I recently found that base32, base64 and base128 are the most efficient forms of base-n encoding, and that while base58, Ascii85, base91, base92 et al do provide some efficiency improvements over the ubiquitous base64 due to their use of more characters, there are some mapping losses; for example, there happen to be 272 indices per character-pair in base92 that are impossible to map to from base-10 powers of 2 and are thus completely wasted. (Base91 encoding only has a similar loss of 89 characters (as found by the script in the link above) but it's patented.)
It would be great if it were viable to use base128 in modern-day real-world scenarios.
There are 92 characters available within 0x21 (33) to 0x7E (126) sans \
and "
, which make for a great start to creating JSONifiable strings with the most characters possible.
Here are a few ways I envisage the rest of the characters could be found. This is the question I'm asking.
Just dumbly use Unicode
Two-byte Unicode characters could be used to fill in the remaining 36 required indices. Highly suboptimal; I wouldn't be surprised if this was worse than base64 on the wire. Would only be useful for Unicode character counting scenarios like tweet length. Not exactly what I'm going for.
Select 36 non-Unicode characters from within the upper (>128) ASCII range
JavaScript was built with the expectation that character encoding configuration will occasionally go horribly wrong. So the language (and web browsers) handle printing arbitrary and unprintable binary data just fine. So why not just use the upper ASCII range? It's there to be used, right?
One very real problem could be data going over HTTP and falling through one or more
can openersproxies on the way between my browser and the server. How badly could this go? I'm aware that WebSockets over HTTP caused some real pain a couple years ago, and potentially even today.
Kind of use UTF-8 in interesting ways
UTF-8 defines 1- to 4-byte long sequences to encapsulate Unicode codepoints. Bytes 2 to 4 always start with
10xxxxxx
. There are 64 characters within that range. If I pass through a naïve proxy that filters characters outside the Unicode range on a character-by-character basis, using bytes within this range might mean my data would get through unscathed!
Determine 36 magic bytes that will work for various esoteric reasons
Maybe there are some high ASCII characters that will successfully traverse >99% of the Internet infrastructure for various historical or implementational reasons. What characters might these be?
Base64 is ubiquitous and has wound up being used everywhere, and it's easy to understand why: it was defined in 1987 to use a carefully-chosen, very restricted alphabet of A-Z, a-z, 0-9, + and / that was (and remains) difficult for most environments (such as mainframes using non-ASCII encoding) to have problems with.
EBCDIC mainframes and MIME email are still very much out there, but today base64 has also wound up as a heavily-used pipe within JavaScript to handle the case of "something in this data path might choke on binary", and the collective overhead it adds is nontrivial.
There's currently only one other question on SO regarding the general viability of base128 encoding, and literally every single answer has one or more issues. The accepted answer suggests that base128 must exactly use the first 128 characters of ASCII, and the only answer that acknowledges that the encoded alphabet can use any characters proceeds to claim that that base128 is not in use because the encoded characters must be easily retypeable (which base58 is optimized for, FWIW). All the others have various problems (which I can explain further if desired).
This question is an attempt to re-ask the above with some additional unambiguous subject clarification, in the hope that a concrete go/no-go can be determined.