From 019e5a77f6c03af4ba5173eef6616f5197add26d Mon Sep 17 00:00:00 2001 From: Sven Gestegard Robertz <sven.robertz@cs.lth.se> Date: Tue, 17 Feb 2015 10:11:49 +0100 Subject: [PATCH] started comparing with Avro and EDN --- doc/tech_report.tex | 107 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 107 insertions(+) diff --git a/doc/tech_report.tex b/doc/tech_report.tex index c6e9cef..93a9049 100644 --- a/doc/tech_report.tex +++ b/doc/tech_report.tex @@ -539,6 +539,113 @@ Java primitive types. However, it is unlikely that the entire range is actually way of supporting the common cases is to include run-time checks for overflow in the Java encoders and decoders. +\section{Related work} + +Two in-band self-descibing communication protocols are Apache +Avro\cite{avro} and EDN, the extensible data notation developed for +Clojure and Datomic\cite{EDN}. + +EDN encodes \emph{values} as UTF-8 strings. The documentation says +``edn is a system for the conveyance of values. It is not a type system, +and has no schemas.'' That said, it is \emph{extensible} in the sense +that it has a special \emph{dispatch charachter}, \verb+#+, which can +be used to add a \emph{tag} to a value. A tag indicates a semantic +interpretation of a value, and that allows the reader to support +handlers for specific tags, enabling functionality similar to that of +labcomm. + +\subsection{Apache Avro} + +Apache Avro is similar to LabComm in that it has a textual language +for declaring data, a binary protocol for transmitting data, and code +generation for several languages. + +\subsubsection*{Data types} + +In the table, the Avro type names are listed, and matched to the +corresponding LabComm type: + +\begin{tabular}{|l|c|c|} +\hline + Type & Labcomm & Avro \\ + \hline Primitive types \\ \hline + +int & 4 bytes & varint \\ +long & 8 bytes & varint \\ +float & 4 bytes & 4 bytes \\ +long & 8 bytes & 8 bytes \\ +string & varint + utf8[] & varint + utf8[] \\ +bytes & varint + byte[] & varint + byte[]\\ + + \hline Complex types \\ \hline + +struct/record & concat of fields & concat of fields \\ +arrays & varIdx[] : elements & block[] \\ +map & n/a & block[] \\ +union & n/a & (varint idx) : value \\ +fixed & byte[n] & the number of bytes declared in +the schema\\ +\hline +\end{tabular} + + where + +\begin{verbatim} + block ::= (varint count) : elem[count] [*1] + count == 0 --> no more blocks + + +[*1] for arrays, count == 0 --> end of array + if count < 0, there are |count| elements + preceded by a varint block_size to allow + fast skipping +\end{verbatim} + +In maps, keys are strings, and values according to the schema. + +In unions, the index indicates the kind of value and the +value is encoded according to the schema. + +Note that the Avro data type \verb+bytes+ corresponds to the +LabComm declaration \verb+byte[_]+, i.e. a varaible length byte array. + +\subsubsection*{the wire protocol} + +\begin{tabular}{|l|c|c|} + \hline + What & LabComm & Avro \\ \hline + Data description & Binary signature & JSON schema \\ + Signature sent only once & posible & possible (stateful) \\ + Signature sent with each sample & possible & possible (stateless) \\ + Data encoding & binary & binary \\ + \hline +\end{tabular} + + +Both avro and labcomm use varints when encoding data, similar in that +they both send a sequence of bytes containing 7 bit chunks (with the +eight bit signalling more chunks to come), but they differ in range, +endianness and signedness. + +\begin{verbatim} + LabComm Avro + unsigned 32 bit signed zig-zag coding + most significant chunk least significant chunk + first first + + 0 -> 00 0 -> 00 + 1 -> 01 -1 -> 01 + 2 -> 02 1 -> 02 + ... -2 -> 03 + 2 -> 04 + ... + 127 -> 7f -64 -> 7f + 128 -> 81 00 64 -> 80 01 + 129 -> 81 01 -65 -> 81 01 + 130 -> 81 02 65 -> 82 01 + ... ... +\end{verbatim} + \bibliography{refs}{} \bibliographystyle{plain} -- GitLab