0

I have a string which is base64 and I need to convert it into utf-8.

base64_string "VABpAG0AZQAgAHMAZQByAGUAaQBzAA=="

I am trying to convert base64_string into utf-8 in the following env:

In browser

method : atob(base64_string)

`Result = "Time series",` 

which is correct. We can verify the same in https://www.base64decode.org

In NodeJs I am converting with npm package "atob"

method : atob(base64_string)

Result = "T i m e  s e r i e s".

For some reasons, I am getting spaces between each character and I don't know why? I have tried to trim, but that is also not working.

Vega
  • 27,856
  • 27
  • 95
  • 103
rachit
  • 5
  • 3

1 Answers1

2

TL;DR;

Your string is actually UTF-16, not UTF-8. Here's how to decode it properly.

function atob(b64txt) {
  const buff = Buffer.from(b64txt, 'base64');
  const txt = buff.toString('utf16le');
  return txt;
}

Explanation: Your base64 encoded string isn't actually UTF-8 or ASCII data. It's UTF-16 (little-endian). That means every character always has two bytes.

UTF-8 is different: any byte that is less than 127 indicates a single-byte character. A byte greater than 127 would have a second byte, and if the second byte is > 127 there would be a third byte, etc.

So let's decode your string to character codes and see what it looks like:

const b64txt = 'VABpAG0AZQAgAHMAZQByAGUAaQBzAA==';
const buff = Buffer.from(b64txt, 'base64');
console.log(JSON.stringify(buff));
// >> {"type":"Buffer","data":[84,0,105,0,109,0,101,0,32,0,115,0,101,0,114,0,101,0,105,0,115,0]}

First character (84) is the ASCII character for T. But it's less than 127, and it still has a 0 byte following it. So...not UTF-8.

That's the clue that this string has two bytes per character, making it UTF-16. And the fact that the 0 follows the character is the clue that it's "little-endian" (the 0-255 byte comes first, and the 256-65536 byte comes second).

If you want to change this buffer into text, you need to interpret it as the correct type of string:

const txt = buff.toString('utf16le'); // <- UTF-16, little-endian
console.log(txt);
// >> "Time sereis"

So in node.js, if you combine those two commands, you end up with a full fledged solution to get your string decoded properly, as above in the TL;DR;.

Of course if your encoding type changes, you'd have to change this as well, and do toString('utf8') or whatever the appropriate encoding is.

(credit: I referenced this and this as I was drafting this answer.)

David784
  • 7,031
  • 2
  • 22
  • 29