89

Why is only base64 instead of base128 used to transmit binary data on the web? The ASCII character set has 128 characters which in theory could represent base 128, but only base64 but not base128 is used in most cases.

SOFe
  • 7,867
  • 4
  • 33
  • 61
gmadar
  • 1,884
  • 3
  • 19
  • 22
  • 60
    Why not even base 256? – Gumbo May 15 '11 at 11:20
  • 22
    I think the point is to have *printable* characters (although there are also more than 64...) – Felix Kling May 15 '11 at 11:20
  • 29
    I think base 128 got belonged to us a while ago. The team assigned to guard base 64 is still holding out. – Ritch Melton May 15 '11 at 11:23
  • 5
    why is this question javascript specific? this holds also true for most other languages that are used in the web, doesn't it? – Benedikt Waldvogel May 15 '11 at 22:08
  • 2
    Thinking about this same thing today, I just came across this question. I have to disagree with the accepted answer. The "printability" of a character has absolutely no bearing on its ability to be reliably transmitted as a string over the wire, especially in the case where both ends assume a UTF-8 encoding of the string. In fact, since the lowest invalid codepoint in UTF-8 is DC80, it would be possible to encode 15-bit values reliably as UTF-8 codepoints for transmission as strings. It seems like a good idea to me if efficiency is a serious concern and human readability is not. – kqnr May 28 '11 at 01:15
  • As an addendum, I think base-2048 would actually be the best compromise if going the UTF-8 route. All encoded values would fit within two UTF-8 bytes, and decoding and encoding is slightly simplified since all values are aligned to 4 bits. FWIW, I think this has real value in transmitting binary data over websockets as efficiently as possible, at least until the binary protocol is standardized and widely implemented. – kqnr May 28 '11 at 01:34
  • 5
    @KenRockot: I see you recognize that some of your 15-bit chars would get encoded into 3 bytes. Your base-2048 encoding means packing 11 bits into 2 bytes, which makes 5.5 bits per bytes - half a bit less than base-64. – maaartinus Jan 28 '14 at 15:05
  • 1
    [base58 is used in Bitcoin](https://github.com/bitcoin/libbase58). – Geremia Mar 13 '18 at 16:54
  • This question is [being discussed on meta](https://meta.stackoverflow.com/q/387159/6296561). – Zoe Jul 13 '19 at 10:45

8 Answers8

104

The problem is that at least 32 characters of the ASCII character set are 'control characters' which may be interpreted by the receiving terminal. E.g., there's the BEL (bell) character that makes the receiving terminal chime. There's the SOT (Start Of Transmission) and EOT (End Of Transmission) characters which performs exactly what their names imply. And don't forget the characters CR and LF, which may have special meanings in how data structures are serialized/flattened into a stream.

Adobe created the Base85 encoding to use more characters in the ASCII character set, but AFAIK it's protected by patents.

Janus Troelsen
  • 20,267
  • 14
  • 135
  • 196
pepoluan
  • 6,132
  • 4
  • 46
  • 76
  • 7
    Base91 seems like a good open source option: http://base91.sourceforge.net/ – Jorge Cevallos Oct 09 '13 at 12:06
  • 2
    It's worth considering that a power of 2 fits byte data more readily, and encoding is simpler. Then there's portability; every language has a base64 encode and/or a base64 decode. – Lodewijk Jul 28 '14 at 12:15
  • 5
    Re *Base85 and Adobe*: the answer could be made more useful if it cited the patent numbers and year granted. If the patents are a problem there's always [`btoa`](https://en.wikipedia.org/wiki/Ascii85#btoa_version), which dates from 1990, is unencumbered by patents, and those would certainly be expired anyway. – agc Mar 08 '17 at 14:22
65

Because some of those 128 characters are unprintable (mainly those that is below codepoint 0x20). Therefore, they can't reliably be transmitted as a string over the wire. And, if you go above codepoint 128, you can have encoding issues because of different encodings used across systems.

driis
  • 161,458
  • 45
  • 265
  • 341
  • 8
    Base94 exists here in github, it uses all 94 printable ASCII characters: https://gist.github.com/iso2022jp/4054241 – intrepidis Jul 05 '15 at 11:07
15

As already stated in the other answers, the key point is to reduce the character set to the printable ones. A more efficient encoding scheme is basE91 because it uses a larger character set and still avoids control/whitespace characters in the low ASCII range. The webpage contains a nice comparison of binary vs. base64 vs. basE91 encoding efficiency.

I once cleaned up the Java implementation. If people are interested I could push it on GitHub.

Update: It's now on GitHub.

Benedikt Waldvogel
  • 12,406
  • 8
  • 49
  • 61
12

That the first 32 characters are control character has absolutely no relevance, because you don't have to use them to get 128 characters. We have 256 characters to choose from, and only the first 32 are control characters. That leaves 192 characters, and therefore 128 is completely possible without using control characters.

Here is the reason: It has to be something that will look the same, and that you can copy and paste, no matter where. Therefor it has to be characters that will be displayed the same on any forum, chat, email and so on. That means that we can't use characters, that a forum/chat/email clients may typically use for formatting or disregard. It also has to be characters that are the same, regardless of font, language and regional settings.

That is the reason!

Adrian
  • 42,911
  • 6
  • 107
  • 99
user3119289
  • 153
  • 1
  • 2
  • 7
    The control characters are relevant because pretty much everyone was already assuming your point that it should be as codepage/encoding neutral as possible. That necessarily restricts you to only (7-bit) ASCII which is a subset of most of the relevant encodings. Also not all of the internet is 8-bit clean, and much of it is defacto ASCII. Your point is worth making though. – Tim Seguine Nov 09 '14 at 13:04
  • 7
    Just to add: ASCII defines only 128 characters. Characters #128 to #255 are _not_ defined in ASCII. Since the question explicitly references ASCII and not "any 8-bit encoding", all answers limit themselves to the 128 characters of the ASCII set. – pepoluan May 12 '16 at 05:50
  • Using the most common UTF-8 encoding as an example: Bytes at 128 to 196 would immediately result in UTF8 decoding errors; bytes at 196 to 256 would imply that the next byte is also of the same character, but then if the next byte is below 128, it would again result in UTF8 decoding errors. However, almost all character-encoding-sensitive languages would have the base64 library take base64 strings as UTF8-safe strings. The same cannot be done with base128 since it can't be encoded as a UTF8-safe string. – SOFe Jul 13 '19 at 05:32
10

Base64 is common because it solves a variety of issues (works nearly everywhere you can think of)

  • You don't need to worry whether the transport is 8-bit clean or not.

  • All the characters in the encoding are printable. You can see them. You can copy and paste them. You can use them in URLs (particular variants). etc.

  • Fixed encoding size. You know that m bytes can always encode to n bytes.

  • Everyone has heard of it - it's widely supported, lots of libraries, so easy to interoperate with.

Base128 doesn't have all those advantages.

It looks like it's 8-bit clean - but recall that base64 uses 65 symbols. Without an out-of-band character you can't have the benefits of a fixed encoding size. If you use an out-of-band character, you can't be 8-bit clean anymore.

It's not all negative though.

  • base128 is easier to encode/decode than base64 - you just use shifts and masks. Can be important for embedded implementations

  • base128 makes slightly more efficient use of the transport than base64 by using more of the available bits.

People do use base128 - I'm using it for something now. It's just not as common.

John La Rooy
  • 295,403
  • 53
  • 369
  • 502
  • Also remember that mail/news systems and their ilk (and also XML) aren't always kind to the first 32 codepoints (consider CR LF vs LF, for example), but otherwise your answer looks very good. – SamB Jan 25 '15 at 01:21
  • "that base64 uses 65 symbols." => typo or did I miss something? – Kikiwa Nov 22 '16 at 13:39
  • @Kikiwa, look at this [java sample on wikipedia](https://en.wikipedia.org/wiki/Base64#Sample_Implementation_in_Java). Check the length of the `CODES` variable. – John La Rooy Nov 22 '16 at 21:50
  • Oh yes, the padding character '=' only at the end of the encoding payload, you're right, thanks. – Kikiwa Nov 23 '16 at 08:34
4

Not sure, but I think the lower values (representing control codes or something) are not reliably transferred as text/characters inside HTTP-requests/responses, and the values above 127 might be locale/codepage/whatever-specific, so there are not 128 different characters that can be expected to work across all browsers/platforms.

esaj
  • 15,875
  • 5
  • 38
  • 52
3

esaji is right. Base64 is used to encode binary data for transmission using a protocol that expects only text. It's right in the Wiki entry.

Russell Troywest
  • 8,635
  • 3
  • 35
  • 40
2

Checkout the base128 PHP-Class. Encoding and decoding with ISO 8859-1 charset.

GoogleCode PHP-Class Base128

Nisse Engström
  • 4,738
  • 23
  • 27
  • 42
seizu
  • 477
  • 4
  • 9
  • 1
    i wish it used utf-8 instead... – Janus Troelsen Sep 20 '12 at 13:12
  • 1
    Base encoding has nothing to do with the underlying data. You can use any text encoding you desire to encode your text/data. What he means is the Base## index table uses the ISO 8859-1 ASCII charset as the translation. – Chad May 21 '14 at 05:55
  • 1
    It does have something to do with the underlying data as soon as you try to *embed* base-encoded binary data in text. If that text is encoded in another encoding, you will have problems. – Stijn de Witt Jan 04 '17 at 01:45
  • There's no such thing as an "ISO 8859-1 ASCII" character set. The program encodes data using 128 different printable ISO 8859-1 characters. **It does not use ASCII**, in any way, shape or form. – Nisse Engström May 09 '17 at 20:39