Skip to content
Snippets Groups Projects
Select Git revision
  • 1c11e3c98d7d5d24305b9b7dea42075af62de0a0
  • master default
  • labcomm2014
  • labcomm2006
  • python_sig_hash
  • typedefs
  • anders.blomdell
  • typeref
  • pragma
  • compiler-refactoring
  • labcomm2013
  • v2014.6
  • v2015.0
  • v2014.5
  • v2014.4
  • v2006.0
  • v2014.3
  • v2014.2
  • v2014.1
  • v2014.0
  • v2013.0
21 results

tech_report.tex

Blame
  • tech_report.tex 21.16 KiB
    % *** en embryo of a technical report describing the labcomm design rationale and implementation ***
    
    \documentclass[a4paper]{article}
    \usepackage{listings}
    %\usepackage{verbatims}
    %\usepackage{todo}
    
    \begin{document}
    \title{Labcomm tech report}
    \author{Anders Blomdell and Sven Gesteg\aa{}rd Robertz }
    \date{draft, \today}
    
    \maketitle
    
    \begin{abstract}
    
    LabComm is a binary protocol suitable for transmitting and storing samples of
    process data. It is self-describing and independent of programming language,
    processor, and network used (e.g., byte order, etc).  It is primarily intended
    for situations where the overhead of communication has to be kept at a minimum,
    hence LabComm only requires one-way communication to operate. The one-way
    operation also has the added benefit of making LabComm suitable as a storage
    format.
    
    LabComm provides self-describing channels, as communication starts with the
    transmission of an encoded description of all possible sample types that can
    occur, followed by any number of actual samples in any order the sending
    application sees fit.
    
    The LabComm system is based on a binary protocol and
    and a compiler that generates encoder/decoder routines for popular languages
    including C, Java, and Python. There is also a standard library for the
    languages supported by the compiler, containing generic routines for
    encoding and decoding types and samples, and interaction with
    application code.
    
    The LabComm compiler accepts type and sample declarations in a small language
    that is similar to C or Java type-declarations.
    \end{abstract}
    \section{Introduction}
    
    %[[http://rfc.net/rfc1057.html|Sun RPC]]
    %[[http://asn1.org|ASN1]].
    
    LabComm has got it's inspiration from Sun RPC~\cite{SunRPC}
    and ASN1~\cite{ASN1}. LabComm is primarily intended for situations
    where the overhead of communication has to be kept at a minimum, hence LabComm
    only requires one-way communication to operate. The one-way operation also has
    the added benefit of making LabComm suitable as a storage format.
    
    Two-way comminication adds complexity, in particular for hand-shaking
    during establishment of connections, and the LabComm library provides
    support for (for instance) avoiding deadlocks during such hand-shaking.
    
    \pagebreak
    \section{Communication model}
    
    LabComm provides self-describing communication channels, by always transmitting
    a machine readable description of the data before actual data is
    sent\footnote{Sometimes referred to as \emph{in-band} self-describing}.
    Therefore, communication on a LabComm channel has two phases
    
    \begin{enumerate}
    \item the transmission of signatures (an encoded description including data
    types and names, see appendix~\ref{sec:ProtocolGrammar} for details) for all sample types
    that can be sent on the channel
    \item the transmission of any number of actual samples in any order
    \end{enumerate}
    
    During operation, LabComm will ensure (i.e., monitor) that a communication
    channel is fully configured, meaning that both ends agree on what messages that
    may be passed over that channel.  If an unregistered sample type is sent or
    received, the LabComm encoder or decoder will detect it and take action.
    In more dynamic applications, it is possible to reconfigure a channel in order to add,
    remove, or change the set of registered sample types. This is discussed
    in Section~\ref{sec:reconfig}.
    
    The roles in setting up, and maintaining, the configuration of a channel are as follows:
    
    \paragraph{The application software} (or higher-level protocol) is required to
    
    \begin{itemize}
    \item register all samples to be sent on a channel with the encoder
    \item register handlers for all samples to be received  on a channel with the decoder
    \end{itemize}
    
    \paragraph{The transmitter (encoder)}
    
    \begin{itemize}
     \item ensures that the signature of a sample is transmitted on the channel before samples are
           written to that channel
    \end{itemize}
    
    \paragraph{The receiver (decoder)}
    
    \begin{itemize}
     \item checks, for each signature, that the application has registered a handler for that sample type
     \item if an unhandled signature is received, pauses the channel and informs the application
    \end{itemize}
    
    The fundamental communication model applies to all LabComm channels and
    deals with the individual unidirectional channels. In addition to that,
    the labcomm libraries support the implementation of higher-level
    handshaking and establishment of bidirectional channels both through
    means of interacting with the underlying transport layer (e.g., for
    marking packets containing signatures as \emph{important}, for
    transports that handle resending of dropped packets selectively), or
    requesting retransmission of signatures.
    
    In order to be both lean and generic, LabComm does not provide a
    complete protocol for establishing and negotiating bidirectional
    channels, but does provide support for building such protocols on top
    of LabComm.
    \subsection{Reconfiguration}
    \label{sec:reconfig}
    
    The fundamental communication model can be generalized to the life-cycle
    of a concrete communication channel, including the transport layer,
    between two end-points. Then, the communication phases are
    \begin{enumerate}
      \item \emph{Establishment} of communication channel at the transport layer
      \item \emph{Configuration} of the LabComm channel (registration of sample
        types)
      \item \emph{Operation} (transmission of samples)
    \end{enumerate}
    where it is possible to \emph{reconfigure} a channel by transitioning
    back from phase 3 to phase 2. That allows dynamic behaviour, as a sample
    type can be added or redefined at run-time. It also facilitates error
    handling in two-way channels.
    
    One example of this, more dynamic, view of a labcomm channel is that the
    action taken when an unregistered sample is sent or received is to
    revert back to the configuration phase and redo the handshaking to
    ensure that both sides agree on the set of sample types (i.e.,
    signatures) that are currently configured for the channel.
    
    From the system perspective, the LabComm protocol is involved in
    phases 2 and 3. The establishement of the \emph{transport-layer}
    channels is left to external application code. In the Labcomm library,
    that application code is connected to the LabComm routines through
    the \emph{reader} and \emph{writer} interfaces,
    with default implementations for sockets or file descriptors (i.e.,
    files and streams).
    
    
    \section{The Labcomm language}
    
    The LabComm language is used to describe data types, and from such data
    descriptions, the compiler generates code for encoding and decoding
    samples. The language is quite similar to C struct declarations, with
    some exceptions. We will now introduce the language through a set of
    examples.
    
    These examples do not cover the entire language
    specification (see appendix~\ref{sec:LanguageGrammar} for the complete
    grammar), but serve as a gentle introduction to the LabComm
    language covering most common use-cases.
    
    \subsection{Primitive types}
    
    \begin{verbatim}
      sample boolean a_boolean;
      sample byte a_byte;
      sample short a_short;
      sample int an_int;
      sample long a_long;
      sample float a_float;
      sample double a_double;
      sample string a_string;
    \end{verbatim}
    
    \subsection{Arrays}
    
    \begin{verbatim}
      sample int fixed_array[3];
      sample int variable_array[_];                // Note 1
      sample int fixed_array_of_array[3][4];       // Note 2
      sample int fixed_rectangular_array[3, 4];    // Note 2
      sample int variable_array_of_array[_][_];    // Notes 1 & 2
      sample int variable_rectangular_array[_, _]; // Notes 1 & 2
    \end{verbatim}
    
    \begin{enumerate}
    \item In contrast to C, LabComm supports both fixed and variable (denoted
    by~\verb+_+) sized arrays.
    
    \item In contrast to Java, LabComm supports multidimensional arrays and not
    only arrays of arrays.
    
    \end{enumerate}
    
    \subsection{Structures}
    
    \begin{verbatim}
      sample struct {
        int an_int_field;
        double a_double_field;
      } a_struct;
    \end{verbatim}
    
    \section{User defined types}
    
    \begin{verbatim}
      typedef struct {
        int field_1;
        byte field_2;
      } user_type[_];
      sample user_type a_user_type_instance;
      sample user_type another_user_type_instance;
    \end{verbatim}
    
    \section{The LabComm system}
    
    The LabComm system consists of a compiler for generating code from the data
    descriptions, and libraries providing LabComm communication facilities in,
    currently, C, Java, Python, C\#, and RAPID\footnote{excluding variable
    size samples, as RAPID has limited support for dynamic memory allocation}.
    
    
    \subsection{The LabComm compiler}
    
    The LabComm compiler generates code for the declared samples, including marshalling and
    demarshalling code, in the supported target languages.
    
    The compiler itself is implemented in Java using the JastAdd~\cite{jastadd} compiler compiler.
    
    \subsection{The LabComm library}
    
    The LabComm libraries contain functionality for the end-to-end transmission
    of samples. They are divided into two layers, where the upper layer implements
    the general encoding and decoding of samples, and the lower one deals with
    the transmission of the encoded byte stream on a particular transport layer.
    
    Thus, the LabComm communication stack looks like this:
    \begin{figure}[h!]
    \begin{verbatim}
        _______________________
        |     Application     |
        +---------------------+
        | encoder  | decoder  |    to/from labcomm encoded byte stream
        +----------+----------+
        | writer   | reader   |    transmit byte stream over particular transport
        +----------+----------+
        | transport layer / OS|
        +---------------------+
    \end{verbatim}
    \end{figure}
    \subsubsection{LabComm actions}
    
    (similar to ioctl())
    The encoder/writer and decoder/reader interfaces consist of a set of actions
    
    One example of this is that there is a a separate writer action for
    transmitting signatures, allowing the writer to treat a signature differently
    from encoded samples, e.g., to allow handshaking during channel setup.
    
    User actions allow the application or a higher level
    protocol to communicate with the underlying transport layer through the LabComm
    encoder.
    
    One example is reliable communication, which is controlled from the application
    but needs to be implemented for each transport at at the reader/writer level.
    (Or not, e.g., TCP)
    
    \section{LabComm is not...}
    
    \begin{itemize}
    \item a protocol for two-way connections
    \item intrinsically supporting reliable communication
    \item providing semantic service-descriptions
    \end{itemize}
    
    But
    
    \begin{itemize}
    \item it is suitable for the individual channels of a structured connection
    \item the user action mechanism allows using feature of different transport layers
      through labcomm (i.e., it allows encapsulation of the transport layer)
    \item the names of samples can be chosen and mapped according to a suitable taxonomy or ontology
    \end{itemize}
    
    
    
    \section{Example and its encoding}
    
    With the following `example.lc` file:
    
    \lstinputlisting[basicstyle=\footnotesize\ttfamily]{../examples/wiki_example/example.lc}
    and this \verb+example_encoder.c+ file
    \lstinputlisting[basicstyle=\footnotesize\ttfamily,language=C]{../examples/wiki_example/example_encoder.c}
    
    \newpage
    
    Running \verb+./example_encoder one two+, will yield the following result in example.encoded:
    \begin{verbatim}
    00000000  01 0c 0b 4c 61 62 43 6f  6d 6d 32 30 31 34 02 30 |...LabComm2014.0|
    00000010  40 0b 6c 6f 67 5f 6d 65  73 73 61 67 65 22 11 02 |@.log_message"..|
    00000020  08 73 65 71 75 65 6e 63  65 23 04 6c 69 6e 65 10 |.sequence#.line.|
    00000030  01 00 11 02 04 6c 61 73  74 20 04 64 61 74 61 27 |.....last .data'|
    00000040  02 08 41 04 64 61 74 61  01 25 40 04 00 00 00 01 |..A.data.%@.....|
    00000050  00 40 09 00 00 00 02 01  01 03 6f 6e 65 40 0e 00 |.@........one@..|
    00000060  00 00 03 02 00 03 6f 6e  65 01 03 74 77 6f 41 04 |......one..twoA.|
    00000070  00 00 00 00 41 04 3f 80  00 00 41 04 40 00 00 00 |....A.?...A.@...|
    00000080
    \end{verbatim}
    
    i.e.,
    \begin{verbatim}
    <version> <length: 12> <string: <len: 11> <"LabComm2014">>
    <sample_decl> <length: 48 
                  <user_id: 0x40> 
                  <string: <len: 11> <"log_message">
      <signature_length: 34>
      <struct_decl:
        <number_of_fields: 2>
        <string: <len: 8> <"sequence"> <type: <integer_type>>
        <string: <len: 4> <"line">> <type: <array_decl
          <number_indices: 1> <variable_index>
          <type: <struct_decl:
            <number_of_fields:2>
            <string: <len: 4> <"last">> <type: <boolean_type>>
            <string: <len: 4> <"data">> <type: <string_type>>
          >>
       >
    >
    <sample_decl> <length: 8> <user_id: 0x41> <string: <len: 4> <"data">>
      <signature_length: 1> <float_type>
    <sample_data> <user_id: 40> <length: 4>  <packed_sample_data>
    <sample_data> <user_id: 40> <length: 9>  <packed_sample_data>
    <sample_data> <user_id: 40> <length: 14>  <packed_sample_data>
    \end{verbatim}
    
    \section{Type and sample declarations}
    
    LabComm has two constructs for declaring sample types, \emph{sample
    declarations} and \emph{type declarations}. A sample declaration is used
    for the concrete sample types that may be transmitted, and is always
    encoded as a \emph{flattened} signature. That means that a sample
    containing user types, like
    
    \begin{verbatim}
    typedef struct {
      int x;
      int y;
    } point;
    
    sample struct {
      point start;
      point end;
    } line;
    \end{verbatim}
    
    is flattened to 
    
    \begin{verbatim}
    sample struct {
      struct {
        int x;
        int y;
      } start;
      struct {
        int x;
        int y;
      } end;
    } line;
    \end{verbatim}
    
    Sample declarations are always sent, and is the fundamental identity of
    a type in LabComm. 
    
    Type declarations is the hierarchical counterpart to sample
    declarations: here, fields of user types are encoded as a reference to
    the type instead of being flattened. As the flattened sample decl is the
    fundamental identity of a type, type declarations can be regarded as
    meta-data, describing the internal structure of a sample. They are
    intended to be read by higher-level software and human system developers
    and integrators.
    
    Sample declarations and type declarations have separate name-spaces in
    the sense that the numbers assigned to them by a labcomm encoder 
    come from two independent number series. To identify which
    \verb+TYPE_DECL+ a particular \verb+SAMPLE_DECL+ corresponds to, the
    \verb+TYPE_BINDING+ packet is used.
    
    \subsection{Example}
    
    The labcomm declaration
    \lstinputlisting[basicstyle=\footnotesize\ttfamily]{../examples/user_types/test.lc}
    can be is encoded as
    \begin{lstlisting}[basicstyle=\footnotesize\ttfamily]
    TYPE_DECL 0x40 "coord" <int> val
    TYPE_DECL 0x41 "point" <struct> <2 fields> 
                                    "x" <type: 0x40> 
                                    "y" <type: 0x40>
    TYPE_DECL 0x42 "line" <struct> <2 fields> 
                                    "start" <type: 0x41> 
                                    "end" <type: 0x41>
    TYPE_DECL 0x43 "foo" <struct> <3 fields> 
                                    "a" <int> 
                                    "b" <int> 
                                    "c" <boolean>
    TYPE_DECL 0x44 "twolines" <struct> <3 fields> 
                                    "l1" <type:0x42> 
                                    "l2" <type:0x42> 
                                    "f" <type:0x43>
    
    SAMPLE_DECL 0x40 "twolines" <flat signature>
    
    TYPE_BINDING 0x40 0x44
    \end{lstlisting}
    
    Note that the id 0x40 is used both for the \verb+TYPE_DECL+ of
    \verb+coord+ and the \verb+SAMPLE_DECL+ of \verb+twoline+, and that the
    \verb+TYPE_BINDING+ binds the sample id \verb+0x40+ to the type id
    \verb+0x44+.
    
    \subsection{Run-time behaviour}
    
    When a sample type is registered on an encoder, a \verb+SAMPLE_DECL+
    (i.e., the flat signature) is always generated on that encoder channel.
    
    If the sample depends on user types (i.e., typedefs), \verb+TYPE_DECL+
    packets are encoded, recursively, for the dependent types and a 
    corresponding \verb+TYPE_BINDING+ is encoded.
    
    If a \verb+TYPE_DECL+ is included via multiple sample types, or
    dependency paths, an encoder may choose to only encode it once, but is
    not required to do so. However, if multiple \verb+TYPE_DECL+ packets are
    sent for the same \verb+typedef+, the encoder must use the same
    \verb+type_id+.
    
    
    
    \section{Ideas/Discussion}:
    
    The labcomm language is more expressive than its target languages regarding data types.
    E.g., labcomm can declare both arrays of arrays and matries where Java only has arrays of arrays
    In the generated Java code, a labcomm matrix is implemented as an array of arrays.
    
    Another case (not yet included) is unsigned types, which Java doesn't have. If we include
    unsigned long in labcomm, that has a larger range of values than is possible to express using
    Java primitive types. However, it is unlikely that the entire range is actually used, so one
    way of supporting the common cases is to include run-time checks for overflow in the Java encoders
    and decoders.
    
    \bibliography{refs}{}
    \bibliographystyle{plain}
    
    \appendix
    \newpage
    
    \section{The LabComm language}
    \label{sec:LanguageGrammar}
    
    \subsection{Abstract syntax}
    \begin{verbatim}
    Program ::= Decl*;
    
    abstract Decl ::= Type <Name:String>;
    TypeDecl : Decl;
    SampleDecl : Decl;
    
    Field ::= Type <Name:String>;
    
    abstract Type;
    VoidType          : Type;
    PrimType          : Type ::= <Name:String> <Token:int>;
    UserType          : Type ::= <Name:String>;
    StructType        : Type ::= Field*;
    ParseArrayType    : Type ::= Type Dim*;
    abstract ArrayType : Type ::= Type Exp*;
    VariableArrayType : ArrayType;
    FixedArrayType    : ArrayType;
    
    Dim ::= Exp*;
    
    abstract Exp;
    IntegerLiteral : Exp ::= <Value:String>;
    VariableSize : Exp;
    \end{verbatim}
    
    \newpage
    \section{The LabComm protocol}
    \label{sec:ProtocolGrammar}
    
    Each LabComm2014 packet has the layout
    \begin{verbatim}
    <id> <length> <data...>
    \end{verbatim}
    where \verb+length+ is the number of bytes of the \verb+data+ part
    (i.e., excluding the \verb+id+ and \verb+length+ fields), and 
    the \verb+id+ gives the layout of the \verb+data+ part as defined 
    in \ref{sec:ConcreteGrammar}
    \subsection{Data encoding}
    LabComm primitive types are encoded as fixed width values, sent in
    network order.  Type fields, user IDs, number of indices and lengths are
    sent in a variable length (\emph{varint}) form.  A varint integer value
    is sent as a sequence of bytes where the lower seven bits contain a
    chunk of the actual number and the high bit indicates if more chunks
    follow. The sequence of chunks are sent with the least significant chunk
    first.  
    
    The built-in data types are encoded as follows:
    \begin{lstlisting}[basicstyle=\footnotesize\ttfamily]
    ||Type      ||Encoding/Size                                      ||
    ||----------||---------------------------------------------------||
    ||boolean   ||  8 bits                                           ||
    ||byte      ||  8 bits                                           ||
    ||short     || 16 bits                                           ||
    ||integer   || 32 bits                                           ||
    ||long      || 64 bits                                           ||
    ||float     || 32 bits                                           ||
    ||double    || 64 bits                                           ||
    ||string    || length (varint), followed by UTF8 encoded string  ||
    ||array     || each variable index (varint),                     ||
    ||          || followed by encoded elements                      ||
    ||struct    || concatenation of encoding of each element         ||
    ||          || in declaration order                              ||
    \end{lstlisting}
    
    \subsection{Protocol grammar}
    \label{sec:ConcreteGrammar}
    \begin{lstlisting}[basicstyle=\footnotesize\ttfamily]
    <packet>       := <id> <length> ( <version>      | 
                                      <type_decl>    | 
                                      <sample_decl>  |
                                      <type_binding> |
                                      <sample_data> )
    <version>      := <string>
    <sample_decl>  := <sample_id> <string> <type>
    <type_decl>    := <type_id> <string> <type>
    <type_binding> := <sample_id> <type_id>
    <user_id>      := 0x40..0xffffffff  
    <sample_id> : <user_id>
    <type_id>   : <user_id>
    <string>       := <string_length> <char>*
    <string_length>:= 0x00..0xffffffff  
    <char>         := any UTF-8 char
    <type>         := <length> ( <basic_type> | <array_decl> | <struct_decl> | <type_id> )
    <basic_type>   := ( <boolean_type> | <byte_type> | <short_type> |
                      <integer_type> | <long_type> | <float_type> |
                      <double_type> | <string_type> )
    <boolean_type> := 0x20 
    <byte_type>    := 0x21 
    <short_type>   := 0x22 
    <integer_type> := 0x23 
    <long_type>    := 0x24 
    <float_type>   := 0x25 
    <double_type>  := 0x26 
    <string_type>  := 0x27 
    <array_decl>   := 0x10  <number_of_indices> <indices> <type>
    <number_of_indices> := 0x00..0xffffffff  
    <indices>      := ( <variable_index> | <fixed_index> )*
    <variable_index> := 0x00  
    <fixed_index>  := 0x01..0xffffffff  
    <struct_decl>  := 0x11  <number_of_fields> <field>*
    <number_of_fields> := 0x00..0xffffffff  
    <field>        := <string> <type>
    <sample_data>  := packed sample data sent in network order, with
                      primitive type elements encoded according to
                      the sizes above
    \end{lstlisting}
    where the \verb+<id>+ in \verb+<packet>+ signals the type of payload,
    and may be either a \verb+<sample_id>+ or a system packet id.
    The labcomm sytem packet ids are:
    \begin{lstlisting}[basicstyle=\footnotesize\ttfamily]
    version:      0x01 
    sample_decl:  0x02 
    type_decl:    0x03 
    type_binding: 0x04          
    \end{lstlisting}
    Note that since the signature transmitted in a \verb+<sample_def>+ is
    flattened, the \verb+<type>+ transmitted in a \verb+<sample_def>+ may
    not contain any \verb+<type_id>+ fields.
    \end{document}