0

This is related to the following question:

Why is base128 not used?

If we want to represent binary data as printable characters, we can hex encode it using a set of 16 printable 'digits' from the ASCII set (yielding 2 digits per byte of data) or we can base64 encoding using a set of 64 printable characters of the ASCII set (yielding roughly 1.33 characters per byte of data)

There is no base128 encoding using ASCII characters because ASCII contains only contains 95 printable characters (there is Ascii85 though which uses 85 characters https://en.wikipedia.org/wiki/Ascii85)

What I wonder is whether there is any standardized representation that uses a selection of 256 printable unicode characters that can be represented in UTF-8, effectively yielding an encoding with 1 printable character per byte of data?

matthias_buehlmann
  • 4,641
  • 6
  • 34
  • 76

2 Answers2

1

There is no such standard encoding. But it can easily be created. Choose 256 random Unicode characters an used them to encode bytes 0 to 255.

Some of the characters will require 2 or more bytes to encode in UTF-8 as only 94 printable characters have a 1 byte encoding.

The most compact encoding you can achieve with this approach is to take these 94 characters (U+0021 to U+007E) and add 162 printable characters requiring 2 bytes for encoding, e.g. U+00A1 to U+0142. It results in an encoding requiring about 1.63 output bytes per input byte. So it's less efficient than Base64. That's probably the reason it hasn't been standardized.

Codo
  • 75,595
  • 17
  • 168
  • 206
  • Well, datasize efficiency isn't really the main argument for base64 since it's still larger than the source data. The main argument for base64 is that it can be copy pasted easily in readable form through e-mail and other text based interfaces. The same would be true for such a UTF-8 encoding (given that 95% of the internet is UTF-8 encoded) – matthias_buehlmann Feb 02 '21 at 15:56
  • If visual efficiency is your main concern (not quite obvious from your question), then such an encoding would indeed make sense. But then the term "UTF-8" is misleading. It would just be about a suitable binary-to-Unicode encoding. How the Unicode characters are internally transmitted or stored is not relevant in this case. – Codo Feb 02 '21 at 16:23
0

Because it is not useful.

To encode 12-bits (just a codepoint sequence from 0 to 0x7FF), you need 2 bytes in UTF-8.

But in BASE64 you need also 2 bytes, and it is much simpler.

For 16-bits you can use 3 bytes. Base64 can encode 18-bits in 3 bytes.

So: more complex and less efficient.

But it will be also more difficult. Correct Unicode text have restricted Unicode sequences: combining characters position. Number of such combining characters. Some codepoints should not be used (either only internally, or never).

Giacomo Catenazzi
  • 8,519
  • 2
  • 24
  • 32
  • of course it's useful - because it's VISUALLY shorter (the binary representation of bas64 encoded data is also larger than the source data). The benefit of such representations is not storage efficiency but shorter, printable representation. And obviously some codepoints have gotchas, but such an encoding would not use such code points. – matthias_buehlmann Feb 02 '21 at 15:53
  • Visually.. you hit a difficult point. Latin Capital letter A is like Greek Capital Alpha, and Cyrillic capital letter A. Some characters requires more horizontal space, some also more vertical space. If you find a use, you can easily create one. Maybe just latin alphabet (lower case and upper case), and you can add to any character 3 or 4 accents (possible more), so you get additional bits of information. Possibly with just a i and a j (very small) you can have 6 bit of information (or more) with diacritics.add few other vocals and you have compact representation. – Giacomo Catenazzi Feb 02 '21 at 16:06