21

ZigZag requires a lot of overhead to write/read numbers. Actually I was stunned to see that it doesn't just write int/long values as they are, but does a lot of additional scrambling. There's even a loop involved: https://github.com/mardambey/mypipe/blob/master/avro/lang/java/avro/src/main/java/org/apache/avro/io/DirectBinaryEncoder.java#L90

I don't seem to be able to find in Protocol Buffers docs or in Avro docs, or reason myself, what's the advantage of scrambling numbers like that? Why is it better to have positive and negative numbers alternated after encoding?

Why they're not just written in little-endian, big-endian, network order which would only require reading them into memory and possibly reverse bit endianness? What do we buy paying with performance?

Endrju
  • 2,354
  • 16
  • 23
  • 4
    If you're interested in something protobuf-like that doesn't do expensive varint encoding, see https://capnproto.org. It's faster but it does take more space on the wire. (Disclosure: I'm the author of Cap'n Proto and also the author of most of Google's open source Protobuf code.) – Kenton Varda Nov 28 '15 at 19:25
  • @KentonVarda Thanks for the info, I wasn't aware of Cap'n'proto. Added to my protocol toolbelt. – Endrju Nov 30 '15 at 08:05
  • 1
    Protocol buffers, at least as of version 2, let's you use fixed size encoded integers (e.g. - fixed32, sfixed64, etc.) if that makes more sense for your application. For example, if your values will be uniformly distributed across the possible range of values, then you want to use the fixed form rather than the variable form. That being said, it is very common for commonly used values to cluster closer to zero where this kind of simple variable length encoding can save a lot of space. – jschultz410 Feb 22 '18 at 06:04

1 Answers1

22

It is a variable length 7-bit encoding. The first byte of the encoded value has it high bit set to 0, subsequent bytes have it at 1. Which is the way the decoder can tell how many bytes were used to encode the value. Byte order is always little-endian, regardless of the machine architecture.

It is an encoding trick that permits writing as few bytes as needed to encode the value. So an 8 byte long with a value between -64 and 63 takes only one byte. Which is common, the range provided by long is very rarely used in practice.

Packing the data tightly without the overhead of a gzip-style compression method was the design goal. Also used in the .NET Framework. The processor overhead needed to en/decode the value is inconsequential. Already much lower than a compression scheme, it is a very small fraction of the I/O cost.

Community
  • 1
  • 1
Hans Passant
  • 922,412
  • 146
  • 1,693
  • 2,536
  • 2
    Thank you very much. I really appreciate your help. Now it makes perfect sense. I've got lost, because I've started looking at Java sources which are [unnecessarily obfuscated in some places](https://github.com/mardambey/mypipe/blob/master/avro/lang/java/avro/src/main/java/org/apache/avro/io/BinaryDecoder.java#L195). Gosh, does Java really need hand-crafted loop unrolling code to work fast? – Endrju Nov 26 '15 at 11:35
  • @Endrju: The performance requirements for low-level library protocols are incredibly high, since the users of such libraries vary so heavily. library code has a very diverse collection of clients, some of which have tough performance requirements. Further, library code has a tendency to become a bottleneck, so optimizing libraries is often vital. Performance is frequently a primary concern when selecting libraries. – Brian Nov 27 '15 at 14:21
  • @Brian I know I know but... loop unrolling? Can't Java JITter after all those years and versions do that equally good - or better?... – Endrju Nov 27 '15 at 18:31
  • 2
    Probably, though maybe not when targeting embedded systems. Optimistically, one hopes someone has actually tested the code and verified that it provides benefits. Realistically, it was probably verified to be both correct and sufficiently performant, then ignored. Unless they're already mucking with it for other reasons, skilled professionals are often hesitant to modify working code which meets performance and correctness goals. You might be able to analyze the change history to figure out why that loop was unrolled; maybe it was done in response to a benchmark? – Brian Nov 27 '15 at 19:10
  • 3
    The second sentence of this answer is wrong. The most significant bit of each octet is reserved to indicate whether or not the encoding continues. The exception to that is the ninth octet, where all 8 bits represent part of the value. – jschultz410 Feb 22 '18 at 06:00
  • 3
    One quick comment which frequently gets lost -- quite often the amount of time needed to transmit / transfer a byte greatly exceeds the processing time needed for weird encodings. Even at 1Gb/s wire speed a modern processor running at 2+ GHz is going to "win" by spending cycles on simple encodings rather than transmitting. There are environments, such as IoT and BLE, where bandwidth is so constrained, and the processor relatively more performant, that bizarre encodings are a huge win. – Julie in Austin Jul 25 '18 at 18:55