Converting a nodejs buffer to string and back to buffer gives a different result in some cases

Question

I created a .docx file. Now, I do this:

// read the file to a buffer
const data = await fs.promises.readFile('<pathToMy.docx>')

// Converts the buffer to a string using 'utf8' but we could use any encoding
const stringContent = data.toString()

// Converts the string back to a buffer using the same encoding
const newData = Buffer.from(stringContent)

// We expect the values to be equal...
console.log(data.equals(newData)) // -> false

I don't understand in what step of the process the bytes are being changed...

I already spent sooo much time trying to figure this out, without any result... If someone can help me understand what part I'm missing out, it would be really awesome!

Perhaps it's because a `.docx` file is not a UTF-8 string at all (it's a binary ZIP file) so maybe trying to convert it to a UTF-8 string is lossy in some way as meaningless or invalid UTF-8 sequences are discarded or dealt with in some way that isn't reversible. — jfriend00, Sep 11 '20 at 01:15
Isn't utf8 just taking sequences of 8 bits and converting them into a character? — Sharcoux, Sep 11 '20 at 08:06
No, that's not what `Buffer.toString()` does. See my answer below. — jfriend00, Sep 11 '20 at 16:44

jfriend00 · Accepted Answer · 2020-09-11T21:16:01.897

A .docXfile is not a UTF-8 string (it's a binary ZIP file) so when you read it into a Buffer object and then call .toString() on it, you're assuming it is already encoding as UTF-8 in the buffer and you want to now move it into a Javascript string. That's not what you have. Your binary data will likely encounter things that are invalid in UTF-8 and those will be discarded or coerced into valid UTF-8, causing an irreversible change.

What Buffer.toString() does is take a Buffer that is ALREADY encoded in UTF-8 and puts it into a Javascript string. See this comment in the doc,

If encoding is 'utf8' and a byte sequence in the input is not valid UTF-8, then each invalid byte is replaced with the replacement character U+FFFD.

So, the code you show in your question is wrongly assuming that Buffer.toString() takes binary data and reversibly encodes it as a UTF8 string. That is not what it does and that's why it doesn't do what you are expecting.

Your question doesn't describe what you're actually trying to accomplish. If you want to do something useful with the .docX file, you probably need to actually parse it from it's binary ZIP file form into the actual components of the file in their appropriate format.

Now that you explain you're trying to store it in localStorage, then you need to encode the binary into a string format. One such popular option is Base64 though it isn't super efficient (size wise), but it is better than many others. See Binary Data in JSON String. Something better than Base64 for prior discussion on this topic. Ignore the notes about compression in that other answer because your data is already ZIP compressed.

What I was trying to achieve is to transform it into a string for storing it in a localStorage. I just needed a string representation of it without changing it's weight. I believe that 'binary' encoding should do what I want, but I need to check. Thanks a lot for your answer, I really thought that UTF8 encoding would accept any bits sequence, just like base64 for instance. — Sharcoux, Sep 11 '20 at 20:46
@Sharcoux - If you want to store binary in a string format, you should probably Base64 encode it. That turns it into an ascii representation that losslessly fits into a Javascript string. The "binary" format in `Buffer.toString()` is not what you think it is. That's a `latin1` encoding which is also not what you have. It is very much misnamed. — jfriend00, Sep 11 '20 at 21:14
base64 is converting 7bits data into 8bits character representation, losing 14% space. But using 'binary' encoding perfectly solved the problem. — Sharcoux, Sep 11 '20 at 22:37

Converting a nodejs buffer to string and back to buffer gives a different result in some cases

1 Answers1