0

We have a java program where the input is a base64 string version of the file. Files would be processed differently depending on their sizes so we have to find a way to determine its size based on its base64 input.

is there a way to do this? I'm thinking of recreating the file from the base64 then get the size but that would mean having to store it temporarily. We don't want that. Whats the best way to do this?

We are using Java 8

1 Answers1

3

Basically, yes. In basis, Base64 encodes 3 bytes using 4 characters. However, you must tackle 2 additional major issues:

  • Base64 is sometimes split up into lines; the spec says whitespace is fine, and must be ignored. The 'new line' is one character (or sometimes two) that therefore must not be counted.
  • What if the file is not an exact multiple of 3? Base64 handles this using a padding algorithm. You always get the Base64 characters in sets of 4, but it is possible that the last set-of-4 encodes only 1 or 2 bytes instead of the usual 3. The = sign is used for padding.

Both of these issues can be addressed fairly easily:

It's not hard to loop over a string and increment a counter unless the character at that position in the string is whitespace.

Then multiply by 3 and divide by 4.

Then subtract 2 if the string ends in ==. If it ends in =, subtract 1.

You can count = signs during your loop.

int countBase64Size(String in) {
  int count = 0;
  int pad = 0;
  for (int i = 0; i < in.length(); i++) {
    char c = in.charAt(i);
    if (c == '=') pad++;
    if (!Character.isWhitespace(c)) count++;
  }
  return (count * 3 / 4) - pad;
}
rzwitserloot
  • 85,357
  • 5
  • 51
  • 72
  • [RFC 2045 #2.10](https://datatracker.ietf.org/doc/html/rfc2045#section-2.10): '"Lines" are defined as sequences of octets separated by a CRLF sequences.' and [#2.1](https://datatracker.ietf.org/doc/html/rfc2045#section-2.1) 'The term CRLF, in this set of documents, refers to the sequence of octets corresponding to the two US-ASCII characters CR (decimal value 13) and LF (decimal value 10) which, taken together, in this order, denote a line break in RFC 822 mail.' – user207421 Jun 18 '21 at 05:33
  • Note that [RFC4648 #3.1 ](https://datatracker.ietf.org/doc/html/rfc4648#section-3.1) is therefore quite wrong to refer to these as 'line feeds'. – user207421 Jun 18 '21 at 05:35
  • @user207421 Fortunately, `Character.isWhitespace()` does the right thing regardless of these esoterics. As a personal tip, falling all over a user of a nebulous term like 'line feed', (i.e. insisting that Line Feed must neccessarily mean the LF special character from the ASCII spec) is not a good mindset as a programmer. Who made ASCII king and gave it perrenial and irrevocable solo rights to the term 'Line Feed'? Documentation and speech is supposed to convey ideas. If the reader/listener 'gets it', the communication was effective. Regardless of (imagined) misuse of terms. – rzwitserloot Jun 18 '21 at 13:40
  • For what it's worth, if I had written that documentation I would try to be more careful in using terms that are nebulous or, worse, likely interpreted as meaning something other than the idea I'm trying to convey. But, 'wrong'? That's much too harsh. – rzwitserloot Jun 18 '21 at 13:40