Which binary encoding to use for a node buffer?

Question

I am working with some data in nodejs that I need to encode in a binary format. Internally I use nodejs Buffers for this, but when I serialize the data, which encoding is the best to use? I am currently using the 'binary' encoding but this is marked as deprecated in the documentation, is there a better choice? I am looking to use a little space as possible in my representation.

Explain what you mean by serializing binary data. If you need to transmit the data in an text based protocol, what encoding is used in that protocol? Do you want to include the buffers in JSON? If you "serialize" to disk just `fs.writeFile` the `Buffer`. — windm, Nov 03 '14 at 17:46
I am serializing to a redis database which can handle strings only though they are 'binary safe' — Max Ehrlich, Nov 03 '14 at 17:54
http://stackoverflow.com/questions/20732332/how-to-store-a-binary-object-in-redis-using-node — windm, Nov 03 '14 at 18:19
Yeah I've read that. My question is asking what is the most space efficient encoding to use. Should I assume from your answer that it is base 64 — Max Ehrlich, Nov 03 '14 at 18:34

Max Ehrlich · Accepted Answer · 2014-11-09T19:14:48.527

In an effort to get a thorough answer to this I ran a few tests using my data. My data consists of a set of 4096 element number arrays. I used two set sizes, one with 100 arrays and the other with 5000 arrays. These were serialized to a redis cache as lists with each element of the redis list as a single serialized array. The size of the key redis was using for the list was then read off using debug object and examining the serializedLength property. Results are summarized in the tables below

100 samples

encoding            size (bytes)
base64             4,177,241
binary                 4,162,398
hex                   4,669,965
JSON               2,271,670
utf16le*                4,543,605
utf8*                     3,640,132
ascii*                    2,929,850

5000 samples

encoding            size (bytes)
base64                213,317,603
binary                213,433,150
hex                    238,609,493
JSON                  115,733,172
utf16le*               232,032,313
utf8*                  185,279,730
ascii*                 149,860,001

* text encodings were provided for completeness and should not be used on real data

Some things to note about these results:

JSON encoding won in both tests and by a large margin, this seems odd to me since it expands the data adding brackets and quotes. I would love to know the reason for this.
Memory consumption for each case should be O(n*d) where n is the number of elements and d is the number of data samples. Memory consumption for the JSON case, however, should be O(c*d) where c is the average number of digits in the numbers.
binary encoding beats base64 encoding on the 100 sample set but not the 5000 sample set
The text encodings (utf16le, utf8, ascii, all marked with a *) should not be used for real data and were included for completeness sake. utf8 actually crashed during deserialization and ascii is known to strip the high bit of any value [1]
The field used for these tests (serializedLength) may be a poor indicator of the actual size of a key [2]. However since all we care about here is the relationship between the size of the different encodings these results should still be useful.

Hopefully someone will find this information useful, I will be switching to JSON for my project. It seems a little weird but the numbers don't lie.

Which binary encoding to use for a node buffer?

1 Answers1

100 samples

5000 samples

Some things to note about these results: