0

Quick note:

I am open to the idea of checking if the string I am passing into Buffer.from is base64 format. I have come to understand that the best way to check for if a string is in base64 format is through regex despite it not being perfect. Consequently, I thought about checking the result of a base64 decode instead of what I pass into the base64 decode.

The code:

let buffer = Buffer.from('hey there', 'base64');
let bufferResult = buffer.toString('utf-8');
console.log(text.toString()) // Output: �쭅��

What I am trying to do:

I want to check against �쭅�� and similar output of buffer.toString() to safeguard my application against bad outputs. I have created simple RegEx's to solve this problem that is to the tune of /^[a-zA-Z]+$/ but I don't think that is robust (primarily because I don't know what buffer.toString() can output).

Am I barking up the wrong tree and should be checking the input of Buffer.from or is there a correct way to achieve what I am trying to do?

Chasen Bettinger
  • 7,194
  • 2
  • 14
  • 30
  • 1
    Even some words are valid base64 strings. Probably a good idea is to use a simple alphanumeric regex that matches a string of more of X chars. Something like `/^[A-Z0-9]{8,}=?$/i`. There will be false positives, but not too many. – Wiktor Stribiżew Nov 19 '18 at 23:35
  • Why check input? Check the output. There's already some who [figured it out](https://stackoverflow.com/questions/8571501/how-to-check-whether-a-string-is-base64-encoded-or-not). – JM-AGMS Nov 20 '18 at 22:14
  • @JM-AGMS, you can detect if the input is incorrect (not all base64 strings decode into a valid binary octet string) and discard the decoding completely based on that. Checking the output can be impossible if any binary string is allowable. – Luis Colorado Nov 22 '18 at 08:35
  • @WiktorStribiżew, why you restrict your regex with `{8,}`? What about `AA==` (a single `0x00` byte)? Or `""` (the empty binary string) ? and what about strings that are not multiples of four in length? (those are not valid base64 encodings, but you don't say a word about those, a simple nine char `"ABCDEFGHI"` is not a valid base64 encoding) – Luis Colorado Nov 22 '18 at 08:40
  • @LuisColorado It is an example, one may decrease the lower limit. As I say, there will always be false positives here as regular words can be valid base64 strings, and this lower limit is a kind of a trade-off here. Replace with `{2,}` and get many false positives. Increase to `{16,}` and small base64 strings won't be returned. Suit yourself. There is no 100% sure answer here. – Wiktor Stribiżew Nov 22 '18 at 08:44
  • @WiktorStribiżew, nope, there are not false positives, the language strings of base64 space are perfectly well defined, no space for falses nothing. `ABCDEFGHI` can never be parsed as a base64 string, as cannot a simple `A`, but any alphanumeric word of 4 chars can. – Luis Colorado Nov 22 '18 at 09:15

1 Answers1

0

There's one problem in your question: There are several encodings for Base64, depending on the extra nonalphanumeric characters used in the string.

Base64 encodings use the set of all uppercase ASCII chars, all lowercase, digits (this makes 26 + 26 + 10 = 62 chars) and two more, that can be (depending on what are you using base64 encodings for) {'+', '/'}, {'.', '-'}, {'.', '_'} and some other (see here for a thorough explanation).

Another issue is that normally, on long Base64 strings, line length is restricted to 76 chars, so base64 strings have interspersed newlines (some with/without the \r of the CRLF pair), until the final line, that can have one, or two '=' chars.

Also, some (not all) base64 strings finish with one or two '=' chars, depending on the total number of chars used (mod 4) (this is not optional, but some encodings --e.g. for urls-- don't use the final equal signs)

If you are pretending to parse +/ (as for mime encoding use) then a valid (and strict) regex for base64 can be:

(((\r?\n|\s)*[A-Za-z0-9+\/]){4})*(((\r?\n|\s)*[A-Za-z0-9+\/]){2}((\r?\n|\s)*=){2}|((\r?\n|\s)*[A-Za-z0-9+\/]){3}((\r?\n|\s)*=){1})?

but think twice before using it, as it will match the longest base64 string possible (because it cannot analyse the context to match) and ignore any extra chars behind it, so for an invalid base64 string like:

ABCDE

(has 5 characters, while base64 has to be multiple of four characters, including the final '='s), it will match the first four ("ABCD" as a valid base64, as the longest base64 string possible to match (for that string to be valid, it should have been encoding as ABCDEA==, (assuming the missing two bits of the last byte are zeros). See the demo above for a sample of this. Also the empty string is matched (it is a valid zero length base64 string)

NOTE

A good base64 decoder not only will parse the string the same way as the regex matcher does, but will also produce the binary string represented on it (with less than very low effort) so I recommend you not to use (in this case) a regex matcher, but only as an exercise, or perhaps for a javascript validator in the client browser, to check format before sending base64 encoded strings to a server, that will need also to decode it again)

NOTE 2

The next is a good test to check for base64 strings: It forces to allow only whitespace between the beginning of the line and the base64 encoded string, and from the end of the encoded string and the end of the line (making the base64 encoding to be forced to use its own lines) This will make it a stronger test:

^(((\r?\n|\s)*[A-Za-z0-9+\/]){4})*(((\r?\n|\s)*[A-Za-z0-9+\/]){2}(=(\r?\n|\s)*){2}|((\r?\n|\s)*[A-Za-z0-9+\/]){3}(=(\r?\n|\s)*))?$

See demonstration here

Community
  • 1
  • 1
Luis Colorado
  • 10,974
  • 1
  • 16
  • 31