Text to an Array Buffer causes files to be corrupted

Question

I have a sample, from it the user can select a file (PDF files in particular), convert that file to an array buffer, construct the file back from that array buffer and download that file. works as expected.

<input type="file" id="file_input" class="foo" />
<div id="output_field" class="foo"></div>


$(document).ready(function(){
    $('#file_input').on('change', function(e){
        readFile(this.files[0], function(e) {
            //manipulate with result...
            $('#output_field').text(e.target.result);
            try {           
            var file = new Blob([e.target.result], { type: 'application/pdf' });
            var fileURL = window.URL.createObjectURL(file);
            var seconds = new Date().getTime() / 1000;
            var fileName = "cert" + parseInt(seconds) + ".pdf";
            var a = document.createElement("a");
            document.body.appendChild(a);
            a.style = "display: none";
            a.href = fileURL;
            a.download = fileName;
            a.click();
             }
            catch (err){
            $('#output_field').text(err);
            }
        });     
    });
});

function readFile(file, callback){
    var reader = new FileReader();
    reader.onload = callback
    reader.readAsArrayBuffer(file);
}

Now let's say I used reader.readAsText(file); isntead of reader.readAsArrayBuffer(file);. In that case I would convert the text to an array buffer and try to do that same thing.

$(document).ready(function(){
    $('#file_input').on('change', function(e){
        readFile(this.files[0], function(e) {
            //manipulate with result...
            try {
            var buf = new ArrayBuffer(e.target.result.length * 2); 
            var bufView = new Uint16Array(buf);
            for (var i=0, strLen = e.target.result.length; i<strLen; i++) {
                     bufView[i] = e.target.result.charCodeAt(i);
            }

            var file = new Blob([bufView], { type: 'application/pdf' });
            var fileURL = window.URL.createObjectURL(file);
            var seconds = new Date().getTime() / 1000;
            var fileName = "cert" + parseInt(seconds) + ".pdf";
            var a = document.createElement("a");
            document.body.appendChild(a);
            a.style = "display: none";
            a.href = fileURL;
            a.download = fileName;
            a.click();
             }
            catch (err){
            $('#output_field').text(err);
            }
        });

    });
});

function readFile(file, callback){
    var reader = new FileReader();
    reader.onload = callback
    reader.readAsText(file);
}

Now if I passed a PDF file that is small in size and only has text, this would work file, but when selecting files that are large and/or has images in them, a currputed file will be downloaded.

Now I do know that I'm trying to make life harder for myself. But what I'm trying to do is somehow convert the result from readAsText() into an arrayBuffer so that both of readAsText() and readAsArrayBuffer() work identicaly.

Is there any reason for you to actually use `readAsText`? As I understand, readAsArrayBuffer is working fine, no? It might very well be that readAsText looses some information in bytes that can't be interpreted as text — JensV, Apr 15 '19 at 14:49
it's because in my particular case, I'm dealing with a backend server that returns files in this format, and I can't change anything about it. this example was just to demonstrate my case and hopefully find a solution to it — user3159792, Apr 16 '19 at 17:47
Can you post the file that you are trying with and the result you got, or at least the difference (using a hex editor)? — Bergi, Apr 22 '19 at 13:29
Why are you using a `FileReader` at all? Just `window.URL.createObjectURL(this.files[0])` should work… — Bergi, Apr 22 '19 at 13:33
If your backend server only supports text you will have to encode your array of bytes in a textual way E.G base64. so you read using ByteArray covert ByteArray to bas64 send it to your server check https://stackoverflow.com/questions/9267899/arraybuffer-to-base64-encoded-string — Barkermn01, Apr 28 '19 at 19:36

score 3 · Answer 1 · answered Apr 22 '19 at 14:09

3

The readAsText method doesn't simply make the bytes accessible in a UCS-16 string. Instead, it decodes them as text, according to a given text encoding format, by default UTF-8. This will mess with any binary data that you are trying to read. As you already figured out, use readAsArrayBuffer for that.

You can try to use a TextEncoder to encode your text back to a typed array, but that's not guaranteed to yield the same result: a BOM gets stripped, invalid UTF-8 sequences lead to errors, and if you're unlucky then even Unicode normalisation will happen.

It might get easier if you explicitly specify a single-byte decoding, but really you should just use readAsArrayBuffer.

answered Apr 22 '19 at 14:09

Bergi

630,263
148
957
1,375

1

`readAsText` of binary data (like images) will corrupt the file. When decoding from UTF-8 there are byte-sequences that can be altered. Only values between 0x00 - 0x7F are copied verbatim. Values between 0xC2 to 0xDF indicates that it is a two byte sequence, 0xF0 to 0xFF that it is a four byte sequence. If the following bytes in the sequence isn't between 0x80 to 0xBF then the sequence is illegal an may be removed or altered. For example [0xF1,0x80,0x80,0x80] is decoded to [0x40000] and [0xF0,0x80,0x80,0x80] to [0xFFFD,0xFFFD,0xFFFD,0xFFFD] (where 0xFFFD is the Replacement Character) – some Apr 29 '19 at 06:21
@some Thanks for the example! – Bergi Apr 29 '19 at 08:52
I forgot to convert 0x40000 to surrogate pairs. [0xF1,0x80,0x80,0x80] is decoded as [0xD8C0,0xDC00]. I added an answer with more examples. – some Apr 29 '19 at 12:08

some · Answer 2 · 2019-04-29T22:38:58.963

As Bergi already have answered, you should use readAsArrayBuffer for binary data instead of readAsText, since the later decodes the byte sequences, by default as UTF-8.

UTF-8 is a variable length encoding, where a character can be between 1 and 4 bytes. Running the decoder on binary data that isn't UTF-8 will irrecoverable corrupt the binary data.

For example, only 0x00-0x7F is copied verbatim. 0xC2 to 0xDF is the start sequence of a 2 byte sequence, 0xE0 to 0xEF of a 3 byte sequence and 0xF0 to 0xFF of a 4 byte sequence. 0x80 to 0xBF is part of a sequence.

Here are a couple of examples of how it gets corrupted (node 12.1):

      ORIGINAL        =>  DECODED from UTF-8 to UCS-2  =>                 ENOCDED from UCS-2 to UTF-8
----------------------------------------------------------------------------------------------------------------------
[0xC2,0x80,0x80,0x80] => [0x0080,0xFFFD,0xFFFD]        => [0xC2,0x80,0xEF,0xBF,0xBD,0xEF,0xBF,0xBD]
[0xC3,0x80,0x80,0x80] => [0x00C0,0xFFFD,0xFFFD]        => [0xC3,0x80,0xEF,0xBF,0xBD,0xEF,0xBF,0xBD]
[0xE0,0x80,0x80,0x80] => [0xFFFD,0xFFFD,0xFFFD,0xFFFD] => [0xEF,0xBF,0xBD,0xEF,0xBF,0xBD,0xEF,0xBF,0xBD,0xEF,0xBF,0xBD]
[0xE1,0x80,0x80,0x80] => [0x1000,0xFFFD]               => [0xE1,0x80,0x80,0xEF,0xBF,0xBD]
[0xF0,0x80,0x80,0x80] => [0xFFFD,0xFFFD,0xFFFD,0xFFFD] => [0xEF,0xBF,0xBD,0xEF,0xBF,0xBD,0xEF,0xBF,0xBD,0xEF,0xBF,0xBD]
[0xF1,0x80,0x80,0x80] => [0xD8C0,0xDC00]               => [0xF1,0x80,0x80,0x80]
[0xF0,0x80,0x00,0x00] => [0xFFFD,0xFFFD,0x0000,0x0000] => [0xEF,0xBF,0xBD,0xEF,0xBF,0xBD,0x00,0x00]
[0x80,0x80,0x80,0x80] => [0xFFFD,0xFFFD,0xFFFD,0xFFFD] => [0xEF,0xBF,0xBD,0xEF,0xBF,0xBD,0xEF,0xBF,0xBD,0xEF,0xBF,0xBD]
[0x81,0x82,0x83,0x84] => [0xFFFD,0xFFFD,0xFFFD,0xFFFD] => [0xEF,0xBF,0xBD,0xEF,0xBF,0xBD,0xEF,0xBF,0xBD,0xEF,0xBF,0xBD]

0xFFFD is the Replacement Character that is used when the input can't be converted to a know codepoint.

score -1 · Answer 3 · answered Apr 22 '19 at 13:17

It could be what I ran into long ago working with graphic files. Binary files are in specific format for a reason, and things like cr/lf might be legit in their own place. By reading a binary file as text and writing it back out, could actually throw in extra cr/lf per line thus throwing off the original format/content/pointers in the file.

To confirm this, I would take your original file, read/write as array buffer to one Test file, then do the same thing with read/write as text to a SecondTest file. Then do a binary compare between the two files.

I would bet you are getting extra stuff in there unintentionally.

Text to an Array Buffer causes files to be corrupted

3 Answers3