Why Base64 is used "only" to encode binary data?

Question

I saw many resources about the usages of base64 in today's internet. As I understand it, all of those resources seem to spell out single usecase in different ways : Encode binary data in Base64 to avoid getting it misinterpreted/corrupted as something else during transit (by intermediate systems). But I found nothing that explains following :

Why would binary data be corrupted by intermediate systems? If I am sending an image from a server to client, any intermediate servers/systems/routers will simply forward data to next appropriate servers/systems/routers in the path to client. Why would intermediate servers/systems/routers need to interpret something that it receives? Any example of such systems which may corrupt/wrongly interpret data that it receives, in today's internet?
Why do we fear only binary data to be corrupted. We use Base64 because we are sure that those 64 characters can never be corrupted/misinterpreted. But by this same logic, any text characters that do not belong to base64 characters can be corrupted/misinterpreted. Why then, base64 is use only to encode binary data? Extending the same idea, when we use browser are javascript and HTML files transferred in base64 form?

score 3 · Accepted Answer · answered Oct 15 '21 at 07:50

3

There's two reasons why Base64 is used:

systems that are not 8-bit clean. This stems from "the before time" where some systems took ASCII seriously and only ever considered (and transferred) 7bits out of any 8bit byte (since ASCII uses only 7 bits, that would be "fine", as long as all content was actually ASCII).
systems that are 8-bit clean, but try to decode the data using a specific encoding (i.e. they assume it's well-formed text).

Both of these would have similar effects when transferring binary (i.e. non-text) data over it: they would try to interpret the binary data as textual data in a character encoding that obviously doesn't make sense (since there is no character encoding in binary data) and as a consequence modify the data in an un-fixable way.

Base64 solves both of these in a fairly neat way: it maps all possible binary data streams into valid ASCII text: the 8th bit is never set on Base64-encoded data, because only regular old ASCII characters are used.

This pretty much solves the second problem as well, since most commonly used character encodings (with the notable exception of UTF-16 and UCS-2, among a few lesser-used ones) are ASCII compatible, which means: all valid ASCII streams happen to also be valid streams in most common encodings and represent the same characters (examples of these encodings are the ISO-8859-* family, UTF-8 and most Windows codepages).

As to your second question, the answer is two-fold:

textual data often comes with some kind of meta-data (either a HTTP header or a meta-tag inside the data) that describes the encoding to be used to interpret it. Systems built to handle this kind of data understand and either tolerate or interpret those tags.
in some cases (notably for mail transport) we do have to use various encoding techniques to ensure text doesn't get mangles. This might be the use of quoted-printable encoding or sometimes even wrapping text data in Base64.

Last but not least: Base64 has a serious drawback and that's that it's inefficient. For every 3 bytes of data to encode, it produces 4 bytes of output, thus increasing the size of the data by ~33%. That's why it should be avoided when it's not necessary.

answered Oct 15 '21 at 07:50

Joachim Sauer

302,674
57
556
614

In a typical client server model, I visit a website which is essentially nothing but my browser asking for webpage from the server of the website. Server will send a binary stream which represent webpage with an image embedded inside it. For simplicity let us assume webpage is only HTML. Now, this binary stream will be cut down into network packets and forwarded to my browser via route of multiple routers. For my first question, where does 8 bit clean/unclean machines fit into this architecture? – driewguy Oct 15 '21 at 08:41
@driewguy: in the example you gave nowhere. HTTP is 8-bit clean and always has been. it's mostly older protocols (SMTP is the primary example) that suffer from this problem. The reason that Base64 is still used in todays web is stuff like embedding images inside HTML (using `base64:` URLs): HTML is textual data and someone decided to embed an image (binary data) in it. That's an example of case #2. – Joachim Sauer Oct 15 '21 at 08:43
You mentioned that textual data is accompanied by meta-data which helps decoding at intermediate machines (though I am not sure why would intermediate machines need to decode data that it receives, hence the first question). Then we can apply similar idea and accompany non-textual data with some metadata which basically says, do not interpret it in any way. Maybe I am understanding it wrongly. Let us take another example. 'A' has ascii value of 65. So, if a HTML file is transferred from server to client (which will be in the form of binary packets) why will 'A' be not interpreted wrongly? – driewguy Oct 15 '21 at 08:50
We could. But we don't. Because protocols are written with one use in mind and defined a certain way (for example sending short text messages as mail using SMTP). Then someone finds an alternative way to use them (such as sending actual binary files) but doesn't/can't wait for all the standards and their implementations to be updated. So they develop a workaround (such as Base64 encoding to transfer binary data over a system designed to transfer ASCII text only). – Joachim Sauer Oct 15 '21 at 08:52
So you mean Base64 is only useful if textual data with embedded binary data (images) is transferred over protocols other than HTTP only? But HTTP is ubiquotous as far as websites are concerned. Why then we need to embed images in base64 inside HTML? – driewguy Oct 15 '21 at 08:52
HTML is specified to contain text in a certain encoding. If you included arbitrary binary data in the middle of HTML the browser would attempt to decode that binary data using the encoding (most frequently UTF-8 in todays web). But since that binary data *is not UTF-8 encoded text* that attempt at decoding will mangle the binary data. Therefore if you want to embed binary data in HTML (which I personally find a bad idea) then you need to encode it with something like Base64 which produces text. – Joachim Sauer Oct 15 '21 at 08:54
Hmmm. That makes sense. But I would have expected that in the long run standards would have been updated particularly since Base64 has such high costs in terms data size – driewguy Oct 15 '21 at 08:54
They are. You *can* build a web site without ever using Base64 these days and many people do. Just transfer each binary thing (i.e. mostly images and videos) over separate HTTP requests and don't try to stuff it into the HTML itself. – Joachim Sauer Oct 15 '21 at 08:55
That makes sense. So for as far as transfer of HTML files via HTTP is concerned, Base64 encoding is avoiding misinterpreting binary data on client side. It is guaranteed that no misinterpretation happen at any intermediate machines. However, there is chance of misinterpreting binary data at intermediate machines if any other protocols are used. Also, if any other protocols are used then there is chance textual data containing characters outside of ASCII 64 range may be misinterpreted too. Am I right in understanding these? – driewguy Oct 15 '21 at 08:58
In the huge majority of cases Base64 is **not used** when transferring HTML over HTTP. The only somewhat frequent use of Base64 in this scenario is if someone wants to embed (a usually small) image inside the HTML using a `base64:` scheme URI. And [as mentioned in this post](https://stackoverflow.com/questions/201479/what-is-base-64-encoding-used-for) Base64 never **guarantees** stuff not getting messed, it just makes mess-ups a lot less likely. – Joachim Sauer Oct 15 '21 at 09:03

score 0 · Answer 2 · answered Oct 15 '21 at 09:11

One of the use of BASE64 is to send email.

Mail servers used a terminal to transmit data. It was common also to have translation, e.g. \c\r into a single \n and the contrary. Note: Also there where no guarantee that 8-bit can be used (email standard is old, and it allowed also non "internet" email, so with ! instead of @). Also systems may not be fully ASCII.

Also \n\n. is considered as end of body, and mboxes uses also \n>From to mark start of new mail, so also when 8-bit flag was common in mail servers, the problems were not totally solved.

BASE64 was a good way to remove all problems: the content is just send as characters that all servers must know, and the problem of encoding/decoding requires just sender and receiver agreement (and right programs), without worrying of the many relay server in between. Note: all \c, \r, \n etc. are just ignored.

Note: you can use BASE64 also to encode strings in URL, without worrying about the interpretation of webbrowsers. You may see BASE64 also in configuration files (e.g. to include icons): special crafted images may not be interpreted as configuration. Just BASE64 is handy to encode binary data into protocols which were not designed for binary data.

Why Base64 is used "only" to encode binary data?

2 Answers2