1

I'm in a situation where I need to revert data back to a buffer that has had toString called on it. For example:

const buffer // I need this, or equivalent
const bufferString = buffer.toString() // This is all I have

The node documentation implies that .toString() defaults to 'utf8' encoding, and I can revert this with Buffer.from(bufferString, 'utf8'), but this doesn't work and I get different data. (maybe some data loss when it is converted to a string, although the documentation doesn't seem to mention this).

Does anyone know why this is happening or how to fix it?

Here is the data I have to reproduce this:

const intArr = [31, 139, 8, 0, 0, 0, 0, 0, 0, 0, 170, 86, 42, 201, 207, 78, 205, 83, 178, 82, 178, 76, 78, 53, 179, 72, 74, 51, 215, 53, 54, 51, 51, 211, 53, 49, 78, 50, 210, 77, 74, 49, 182, 208, 53, 52, 178, 180, 72, 75, 76, 52, 75, 180, 76, 50, 81, 170, 5, 0, 0, 0, 255, 255, 3, 0, 29, 73, 93, 151, 48, 0, 0, 0]
const buffer = Buffer.from(intArr) // The buffer I want!
const bufferString = buffer.toString() // The string I have!, note .toString() and .toString('utf8') are equivalent
const differentBuffer = Buffer.from(bufferString, 'utf8') 

You can get the initial intArr from a buffer by doing this:

JSON.parse(JSON.stringify(Buffer.from(buffer)))['data']

Edit: interestingly calling .toString() on differentBuffer gives the same initial string.

Ben Gooding
  • 884
  • 9
  • 18

2 Answers2

3

I think the important part of the documentation you linked is When decoding a Buffer into a string that does not exclusively contain valid UTF-8 data, the Unicode replacement character U+FFFD � will be used to represent those errors. When you are converting your buffer into a utf8 string, not all characters are valid utf8, as you can see by doing a console.log(bufferString); almost all of it comes out as gibberish. Therefore you are irretrievably losing data when converting from the buffer into a utf8 string and you can't get that lost data back when converting back into the buffer.

In your example if you were to use utf16 instead of utf8 you don't lose information and thus your buffer is the same once converting back. I.E

const intArr = [31, 139, 8, 0, 0, 0, 0, 0, 0, 0, 170, 86, 42, 201, 207, 78, 205, 83, 178, 82, 178, 76, 78, 53, 179, 72, 74, 51, 215, 53, 54, 51, 51, 211, 53, 49, 78, 50, 210, 77, 74, 49, 182, 208, 53, 52, 178, 180, 72, 75, 76, 52, 75, 180, 76, 50, 81, 170, 5, 0, 0, 0, 255, 255, 3, 0, 29, 73, 93, 151, 48, 0, 0, 0]
const buffer = Buffer.from(intArr);
const bufferString = buffer.toString('utf16le');
const differentBuffer = Buffer.from(bufferString, 'utf16le') ;
console.log(buffer); // same as the below log
console.log(differentBuffer); // same as the above log
Wodlo
  • 847
  • 5
  • 8
  • Um, no, that won't work. Just like UTF-8, there are byte sequences that are not valid UTF-16. – Mark Adler Aug 13 '21 at 19:57
  • @MarkAdler It's true that lone surrogates are invalid in UTF-16, but I confirmed that `toString("UTF-16LE")` simply packs it into [a Javascript string, which is an encodingless 16-bit unsigned integer sequence](https://stackoverflow.com/a/76341865/4510033), not actually validating it in UTF-16 (yes, the argument is misleading). The documentation says `"UTF-16LE"` and `"UCS-2"` are synonymous to `Buffer.toString`. – Константин Ван Jun 03 '23 at 00:07
  • In short, `.toString("UTF-16LE")` simply rearranges the octets into a Javascript string, whether or not those octets are lone surrogates, invalid in UTF-16. Actually it's memory-efficient than `toString("binary")`. – Константин Ван Jun 03 '23 at 00:12
  • But do note that, it ignores the last octet if the number of octets is an odd number. – Константин Ван Jun 03 '23 at 00:14
2

Use the 'latin1' or 'binary' encoding with Buffer.toString and Buffer.from. Those encodings are the same and map bytes to the unicode characters U+0000 to U+00FF.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158