6

I have many bunches of binary data, ranging from 16 to 4096 bytes, which need to be stored to a database and which should be easily comparable as a unit (e.g. two bunches of data batch only if the lengths match and all bytes match). Strings are nice for that, but converting binary data blindly to a string is apt to cause problems due to character encoding/reinterpretation issues.

Base64 was a common method for storing strings in an era when 7-bit ASCII was the norm; its 33% space penalty was a little annoying, but not horrible. Unfortunately, if one is using UTF-16, the space penalty is 166% (8 bytes to store 3) which seems pretty icky.

Is there any common storage method for storing binary data in a valid Unicode string which will allow better efficiency in UTF-16 (and hopefully not be too horrible in UTF-8)? A base-32768 coding would store 240 bits in sixteen characters, which would take 32 bytes of UTF-16 or 48 bytes of UTF-8. By comparison, base64 coding would use 40 characters, which would take 80 bytes of UTF-16 or 40 bytes of UTF-8. An approach which was designed to take the same space in UTF-8 or UTF-16 might store 48 bits in three characters that would take eight bytes in either UTF-8 or UTF-16, thus storing 240 bits in 40 bytes of either UTF-8 or UTF-16.

Are there any standards for anything like that?

supercat
  • 77,689
  • 9
  • 166
  • 211
  • Not all tools seem to like blobs. Admittedly it probably isn't worth bending over backward to construct a data field so someone can cut and paste data into it using SQL Server Explorer, but it can be handy. Perhaps there aren't enough data transport methods which can deal with UTF-8 and UTF-16 but can't handle binary data to make an interchange format worthwhile, but I thought there might be. Certainly storing base64 data in a 16-bit character set feels icky. – supercat Dec 10 '10 at 23:40

1 Answers1

8

Base32768 does exactly what you wanted. Sorry it took five years to exist.

Usage (this is JavaScript, although porting the base32768 module to another programming language is eminently practical):

var base32768 = require("base32768");

var buf = new Buffer("d41d8cd98f00b204e9800998ecf842", "hex"); // 15 bytes

var str = base32768.encode(buf); 
console.log(str); // "迎裶垠⢀䳬Ɇ垙鸂", 8 code points

var buf2 = base32768.decode(str);
console.log(buf.equals(buf2)); // true

Base32768 selects 32,768 characters from the Basic Multilingual Plane. Each character takes 2 bytes when represented as UTF-16 or 3 bytes when represented as UTF-8, giving exactly the efficiency characteristics you describe: 240 bits can be stored in 16 characters i.e. 32 bytes of UTF-16 or 48 bytes of UTF-8. (Except for the occasional padding character, analogous to the = padding seen in Base64.)

This is done by dicing the input bytes (i.e. 8-bit unsigned numbers) into 15-bit unsigned numbers and assigning each resulting 15-bit number to one of the 32,768 characters.

Note that the characters chosen are also "safe" - no whitespace, control characters, combining diacritics or susceptibility to normalization corruption.

qntm
  • 4,147
  • 4
  • 27
  • 41
  • Interesting. Since the code gives the block starts as characters rather than code points I can't tell by looking at it, but was wondering whether every character in the encoding will take exactly three bytes in UTF-8. From a storage-efficiency standpoint, a variable-length coding could have advantages, but fixed-format coding would seem more efficient to work with. Further, if variable-length coding were desired, it would seem better to use the one- and two-byte UTF8 codes for purposes other than as "bulk data" holders (e.g. as markers for repeated sections of data). – supercat Apr 18 '16 at 22:23
  • 1
    @supercat Yes, all of the chosen characters come from the Basic Multilingual Plane, so 3 bytes of UTF-8 or 2 bytes of UTF-16. – qntm Apr 19 '16 at 11:18
  • @qntm: I'd thought the BMP was inclusive of all code points below 65536; if it isn't, what would be the inclusive term for such code points? Any idea if this coding is used anywhere? Personally, I think using two bytes for ASCII characters was a silly idea (even foreign-text documents contain a huge amount of ASCII content like HTML/XML tags, etc.) but adding that to Base64 overhead is even more hideous. Since posting the above I've also wondered whether it would make sense to have an encoding that uses one code point in a 256-code range below 2048, such that... – supercat Apr 19 '16 at 13:57
  • ...each encoded byte would be two bytes of text for both UTF8 and UTF16 coding, and the other byte would be constant for all 256 values. Would any range of 256 code points below 2048 be suitable for that purpose? – supercat Apr 19 '16 at 13:59
  • @qntm Better, but you might want to read [How can I link to an external resource in a community-friendly way](http://meta.stackexchange.com/questions/94022). You've got a description of _what it is_, and _what it does_; what's missing is _how to use it to solve the specific problem_. – Mogsdad Apr 19 '16 at 14:35
  • @supercat You're correct, "Basic Multilingual Plane" is all code points 0 to 65535 inclusive. Sorry if anything I said implied otherwise. – qntm Apr 19 '16 at 15:09
  • @supercat It's true, UTF-16 is not terribly efficient for ASCII-heavy texts, and it has other flaws. I understand it's better than UTF-8 for Chinese/Japanese/Korean text, however. – qntm Apr 19 '16 at 15:10
  • @supercat I'm not aware of anywhere that Base32768 is already in use. However, I do believe that it is fit for use for your purpose. – qntm Apr 19 '16 at 15:12
  • @supercat Your suggested alternate encoding, call it Base256, would have 50% efficiency in UTF-8 (Base64 has 75%), and 50% efficiency in UTF-16 (Base32768 has 94%). In UTF-8 it's optimal to make maximum use of characters with 1-byte encodings (i.e. ASCII). In UTF-16 it's optimal to make maximum use of characters with 2-byte encodings (i.e. the entire BMP). Good thought though. Sorry for the quad-post, I wanted to respond point by point. – qntm Apr 19 '16 at 15:20
  • @qntm: A base256 format would allow concatenation of encoded strings regardless of length, and its 50% efficiency in UTF16 would still be a step up from base64. UTF-16 was intended to be more efficient than UTF-8 for text which uses many code points in the 2048-65535 range, but many kinds of documents, regardless of language, contain so much ASCII markup which is intended for machine processing that UTF-16 doesn't offer much advantage even there. – supercat Apr 19 '16 at 15:53
  • @supercat I wasn't able to find a block of 256 usable characters in the [128, 2048) range, nor any blocks of 128 and only 2 blocks of 64. The best I can suggest is 8 blocks of 32, starting at code points 384, 576, 608, 640, 1184, 1280, 1664 and 1888. Note that all these CPs are divisible by 32, which means the final 5 bits of the input byte will be the same as the final 5 bits of the code point. – qntm Apr 19 '16 at 18:13
  • @qntm: Well, thanks for trying. Have you looked for mappings which would leave the bottom 8 bits alone [but allowing each byte value to chose independently from among the seven or eight choices of MSB that would yield code points 0-2047]? Encoding would require use of a lookup table, but decoding could then simply drop the upper byte. – supercat Apr 19 '16 at 18:30
  • @supercat Neat idea. That worked. Here's your lookup table: `"ԀԁЂԃЄЅІԇЈЉЊЋԌԍԎЏĐđВГДЕЖЗИԙКЛȜȝОПȠȡȢȣȤȥĦħШЩЪЫЬЭЮЯаıвгȴȵȶȷĸȹȺȻȼȽȾȿɀŁłɃɄɅɆɇɈɉŊŋɌɍɎɏɐɑŒœɔɕɖɗɘəɚɛɜɝɞɟɠɡɢɣɤɥŦŧɨɩɪɫɬɭɮɯɰɱɲɳɴɵɶɷɸɹɺɻɼɽɾɿƀƁƂƃƄƅƆƇƈƉƊƋƌƍƎƏƐƑƒƓƔƕƖƗƘƙƚƛƜƝƞƟʠʡ¢£¤¥¦Ƨƨ©ƪƫ¬ƭ®ʯ°±ƲƳƴƵƶƷƸƹƺƻƼƽƾƿǀǁǂǃ˄˅ÆˇˈˉˊˋˌˍˎˏÐˑ˒˓˔˕˖רϙϚϛϜǝÞßϠϡϢϣǤǥæ˧˨˩˪˫ˬ˭ˮ˯ð˱˲˳˴˵Ƕ÷ø˹˺˻˼˽þ˿"` 256 characters long. Each character has a code point in [128, 2048) and the final 8 bits of the code point equal the position in the string. – qntm Apr 19 '16 at 20:01
  • @qntm: Code point 0x0519 shows up as a box on my browser, but it's described as a Cyrillic small yae. Sorta icky that it doesn't show up, but if there are no normalization issues that may be better than any of the other codes whose low byte is 0x19? – supercat Apr 19 '16 at 21:10
  • @supercat Sorry, you never specified "shows up in most browsers" as a constraint. That's a very difficult constraint to work with because it's based on what fonts you have installed locally. You're not relying on being able to read the data visually, are you? In my experience, as long as the box shows up, the text data will be preserved... – qntm Apr 19 '16 at 21:17
  • @qntm: Having things show up would be nice if practical, but it would be of secondary importance to other issues. It seems that other 0x0_19 characters are either combining marks, combined marks, or show up right to left, so 0x0519 may be the best choice. Are all the other marks in LTR scripts? – supercat Apr 19 '16 at 21:30