4

I am trying to ascertain if there is a standard arithmetical formula which, given the length of an unencoded string, will reveal the length of that string when it has been base-64 encoded.

Here is a list of strings and their base-64 encodings:

A : QQ==
AB : QUI=
ABC : QUJD
ABCD : QUJDRA==
ABCDE : QUJDREU=
ABCDEF : QUJDREVG
ABCDEFG : QUJDREVGRw==
ABCDEFGH : QUJDREVGR0g=
ABCDEFGHI : QUJDREVGR0hJ
ABCDEFGHIJ : QUJDREVGR0hJSg==
ABCDEFGHIJK : QUJDREVGR0hJSks=
ABCDEFGHIJKL : QUJDREVGR0hJSktM

Here are the string lengths of the original strings and the lengths of their base-64 encoded strings (not including the = signs sometimes appended to the end of the encoding):

1 : 2
2 : 3
3 : 4
4 : 6
5 : 7
6 : 8
7 : 10
8 : 11
9 : 12
10 : 14
11 : 15
12 : 16

What single formula, when applied to the numbers on the left, results in the numbers on the right?

Cœur
  • 37,241
  • 25
  • 195
  • 267
Rounin
  • 27,134
  • 9
  • 83
  • 108
  • 4
    Does this answer your question: [Base64: What is the worst possible increase in space usage?](https://stackoverflow.com/a/4715480/421195): `ceil(n / 3) * 4`? – paulsm4 Sep 15 '19 at 15:46
  • 2
    Maybe something like `Math.ceil(str.length * (4 / 3))` – Victor Sep 15 '19 at 15:47
  • 2
    @Rounin: As you can see from the replies, the "answer" to your question is a bit more subtle than it might appear at first glace, Please refer to: 1)) [RFC 4648](https://tools.ietf.org/html/rfc4648), which explicitly says "`...In some circumstances, the use of padding ("=") in base-encoded data is not required or used...`". 2) [MDN: Base64 encoding and decoding](https://developer.mozilla.org/en-US/docs/Web/API/WindowBase64/Base64_encoding_and_decoding), with its discussion of JS `atob()`, `btoa()`, and "The Unicode Problem". – paulsm4 Sep 15 '19 at 20:52

3 Answers3

4

Your question is muddled, because of the part where you say "not including the = signs sometimes appended to the end of the encoding".

I'm not saying the length of the non-= portion of a base64 encoding result is uninteresting -- perhaps you have valid reasons for wanting to know that.

But if you are trying to calculate, say, the storage needed for a base64 encoding result, you need to include storage for the = signs; a base64 result cannot be decoded without them. Observe:

echo -n 'ABCDE' | base64
QUJDREU=

$ echo -n 'QUJDREU=' | base64 --decode | od -c
0000000    A   B   C   D   E                                            

$ echo -n 'QUJDREU' | base64 --decode | od -c
0000000    A   B   C                                                    

NOTE #1 : It is possible to not store the =-signs, because it is possible to calculate when they are missing from a given base64 result; they don't strictly speaking need to be stored, but they do need to be supplied for the decoding operation. But then you'd need a custom decoding operation that first looks to see if the padding is missing. I wager that storing at worst 2 extra bytes is far less expensive than the hassle / complexity / unexpectedness of a custom base64 decoding function.

NOTE #2 : As per follow-up comments, some libraries have base64 functions that support missing padding. Treatment of padding is implementation-specific. In some contexts, padding is mandatory (per the relevant specs). Each of the following is a reasonable treatment of padding for any specific library:

  1. implicit padding : assume padding characters for inputs whose length is one or two bytes short of a multiple of 4 bytes (note: 3 bytes short is still invalid, since base64 encoding can only be 0, 1, or 2 bytes short)

  2. best-effort decoding : decode the longest portion of the input that is divisible by 4 bytes

  3. assume truncation : reject as invalid an input whose length is not divisible by 4 bytes, on the assumption that this indicates an incomplete transmission

Again, which of these is most correct will depend upon the context in which the code in question is operating, and different library authors will make different determinations on this.

The answer from @Victor is the best answer; it is the most germane to the context of the question (Javascript), and considers the crucial bytes-vs-characters issue as well.

landru27
  • 1,654
  • 12
  • 20
  • 1
    Thank you, @landru27. That's interesting. With javascript `window.atob()` function, I am seeing the same results after decoding both `QUJDREU=` and `QUJDREU`. That said, I do now understand that all `base64` encoding lengths are always exactly divisible by 4 - and that the point of the `=` padding is to ensure that this is always the case. – Rounin Sep 15 '19 at 16:44
  • 1
    @Rounin You are right: JavaScript as well as other programming languages don't use padding characters for decoding. For example, [atob specification](https://www.w3.org/TR/html50/webappapis.html#dom-windowbase64-atob) says: _"if input ends with one or two "=" (U+003D) characters, remove them from input"_. By the way, on my machine with `base64 (GNU coreutils) 8.23` the command `echo -n 'QUJDREU' | base64 --decode | od -c` returns the right result (`A B C D E`) even if it throws the `base64: invalid input` warning. – Victor Sep 15 '19 at 16:56
  • 2
    @Victor : agreed, the treatment of padding is an implementation-specific detail; my answer over-generalizes; in some contexts (such as early e-mail RFCs, where the bulk of my exposure to base64 arises from) the padding is mandatory; with a library function that follows the note I added, the "hassle / complexity / unexpectedness" disappears, and there is no reason to worry about the padding – landru27 Sep 15 '19 at 20:21
  • 1
    @Rounin : please see my comment to Victor, and my edited answer; I'm familiar with base64 in general, but have never needed to do so in a Javascript context; Victor's answer is much more accurate, and more precise as well – landru27 Sep 15 '19 at 20:56
4

Function https://stackoverflow.com/a/57945696/230983 does exactly what Rounin needs. But if you want to support Unicode characters you cannot rely on the length method, so you need something else to count the number of bytes. A simple way to solve this is to use blobs:

/**
 * Guess the number of Base64 characters required by specified string
 *
 * @param {String} str
 * @returns {Number}
 */
function detectB64CharsLength(str) {
  const blob = new Blob([str]);
  return Math.ceil(blob.size * (4 / 3))
}

/**
 * A dirty hack for encoding Unicode characters to Base64
 * 
 * @link https://developer.mozilla.org/en-US/docs/Web/API/WindowBase64/Base64_encoding_and_decoding#The_Unicode_Problem
 * @param {String} data
 * @returns {String}
 */
function utoa(data) {
  return btoa(unescape(encodeURIComponent(data)));
}

// Run some tests and make sure everything is ok
['a', 'ab', 'ββ', ''].map(v => {
  console.log(v, detectB64CharsLength(v), utoa(v));
});
Victor
  • 5,493
  • 1
  • 27
  • 28
  • 2
    @Rounin I am very glad that we found a solution that solves your problem. By the way, `Blob` is not supported by MSIE <= 9, so be careful if you are planning to explore the Jurassic World :) – Victor Sep 15 '19 at 21:26
0

As I was finishing typing out the question above, I realised (I think) what the formula is.

  1. Divide the original string length by 3.
  2. Round up that new number
  3. Add the rounded up new number to the original string length

Like this:

getLengthOfStringAfterBase64Encoding = (string) => {

  const stringLength = string.length;

  const base64EncodedStringLength = stringLength + Math.ceil(stringLength / 3);

  return base64EncodedStringLength;

} 
Rounin
  • 27,134
  • 9
  • 83
  • 108
  • 1
    This won't work. let str = 'https://stackoverflow.com/questions/57945655/given-the-the-length-of-an-unencoded-string-what-single-formula-reveals-the-len#57945696' btoa(str).length == 180 but str.length + Math.ceil(str.length/3) == 178 I don't think there's an answer because the length of the base64 string depends on what was encoded. – carson Sep 15 '19 at 15:51
  • The string length is 133 but base 64 encoded is 180. With your formula yields 178. Try your example with the following string: ausd8aud8as7d897ad7a89sdad. It's off by one – carson Sep 15 '19 at 15:59
  • 1
    I think the disparity arises from this part of the question : "not including the = signs sometimes appended to the end of the encoding" ... about which I am prompted to ask, Rounin - why are you ignoring those = signs? they are part of the base64 encoded result ... – landru27 Sep 15 '19 at 16:11
  • 1
    ... in other words, you shouldn't name your function `getLengthOfStringAfterBase64Encoding()` if you are ignoring part of the base64 encoding ... – landru27 Sep 15 '19 at 16:12
  • 2
    In fact, @carson is right, but not because of counting the padding character. The problem is that any functions which rely on the `.length` method will give you wrong results for Unicode characters. For example, `getLengthOfStringAfterBase64Encoding('ββ')` returns `3`, while it should be `6`, since `ββ` is encoded as `zrLOsg==`. – Victor Sep 15 '19 at 16:13