45

I have long file I need to parse. Because it's very long I need to do it chunk by chunk. I tried this:

function parseFile(file){
    var chunkSize = 2000;
    var fileSize = (file.size - 1);

    var foo = function(e){
        console.log(e.target.result);
    };

    for(var i =0; i < fileSize; i += chunkSize)
    {
        (function( fil, start ) {
            var reader = new FileReader();
            var blob = fil.slice(start, chunkSize + 1);
            reader.onload = foo;
            reader.readAsText(blob);
        })( file, i );
    }
}

After running it I see only the first chunk in the console. If I change 'console.log' to jquery append to some div I see only first chunk in that div. What about other chunks? How to make it work?

mnowotka
  • 16,430
  • 18
  • 88
  • 134

6 Answers6

85

FileReader API is asynchronous so you should handle it with block calls. A for loop wouldn't do the trick since it wouldn't wait for each read to complete before reading the next chunk. Here's a working approach.

function parseFile(file, callback) {
    var fileSize   = file.size;
    var chunkSize  = 64 * 1024; // bytes
    var offset     = 0;
    var self       = this; // we need a reference to the current object
    var chunkReaderBlock = null;

    var readEventHandler = function(evt) {
        if (evt.target.error == null) {
            offset += evt.target.result.length;
            callback(evt.target.result); // callback for handling read chunk
        } else {
            console.log("Read error: " + evt.target.error);
            return;
        }
        if (offset >= fileSize) {
            console.log("Done reading file");
            return;
        }

        // of to the next chunk
        chunkReaderBlock(offset, chunkSize, file);
    }

    chunkReaderBlock = function(_offset, length, _file) {
        var r = new FileReader();
        var blob = _file.slice(_offset, length + _offset);
        r.onload = readEventHandler;
        r.readAsText(blob);
    }

    // now let's start the read with the first block
    chunkReaderBlock(offset, chunkSize, file);
}
alediaferia
  • 2,537
  • 19
  • 22
  • 5
    This is brilliant. Reading huge 3GB+ files without issue. The small chunk size makes it a bit slow though. – bryc Feb 15 '15 at 05:54
  • Wrote a CRC32 calculator using this for fun using web workers/dragndrop. http://jsfiddle.net/9xzf8qqj/ – bryc Feb 15 '15 at 06:23
  • 2
    Worked for me as well for large files. However, for larger files (>9GB), I found out incrementing `offset` by `evt.target.result.length` was **corrupting** my file! My quick solution was to increment it by `chunkSize` instead. I'm not sure if it's a FS issue (I'm on Ubuntu) or something else, but it works just fine for any filesize if you `offset += chunkSize`. – user40171 May 11 '15 at 05:41
  • 1
    I kind of improved it here: https://gist.github.com/alediaferia/cfb3a7503039f9278381 I didn't test it though, so if you notice glitches please let me know. – alediaferia Jun 22 '15 at 10:52
  • I was just thinking... Wouldn't it be better to call next `block()` before invoking callback, so that the async IO is already going on while callback is executing? Because the callback is very likely gonna use some CPU for parsing and it may take a while. – Tomáš Zato Oct 29 '15 at 15:20
  • @TomášZato actually it wouldn't matter much. I'm not a Javascript expert at all, actually I don't really code with Javascript but I think that the actual `async IO` won't start anyway until the `readEventHandler` stack is popped. – alediaferia Mar 15 '16 at 20:18
  • @alediaferia Why do you think so? I guess in that case you would need add some asyncness in the code. In that case it wouldn't be worth it, even though it's just 2 lines. – Tomáš Zato Mar 15 '16 at 20:22
  • FileReader `onload` callback is invoked when the data has been read from the file. This happens "asynchronously" in the sense that it is handled by the JavaScript Event Loop as soon as it is possible to process it. This means that even if I call `chunkReaderBlock` earlier, no IO would occur as long as the stack is still busy with the `readEventHandler` call. For reference: https://developer.mozilla.org/en/docs/Web/JavaScript/EventLoop – alediaferia Mar 15 '16 at 22:43
  • Do you have any recommendation on how I could determine which is the last chunk? I need to make a different REST call for the last chunk – Batman Sep 03 '16 at 18:24
  • @user40171 Thanks for `offset += chunkSize` : my file wasn't corrupted, but I couldn't get the right number of chunks. +1 – Ontokrat May 03 '17 at 19:01
  • 2
    according to the [docs](https://developer.mozilla.org/en-US/docs/Web/API/FileReader), ```onload``` is only called if there is no error. Use ```onloadend``` otherwise. I would however recommend using ```onload``` and ```onerror```.In short: the code above is never catching any error. – Flavien Volken Apr 14 '18 at 15:00
  • Hi sir, I'm trying to use your code for uploading big files. I retrieve my file from input with `var file = document.getElementById("file").files` and pass it to parseFile function. I met this error `Uncaught TypeError: _file.slice is not a function`. I realized that file is not a string. Should I read it as a text? How can I do it? Thanks – Andrea Martinelli Aug 03 '18 at 11:09
  • So that chunk that gets passed to the callback, what format is that in? I need to convert it to a byte format so that I can pass it up to the server and process it properly. – Marcel Marino Sep 19 '19 at 15:19
  • 3
    `var self = this; // we need a reference to the current object` where exactly is this used? – SOFe Mar 05 '20 at 03:46
  • 2
    I managed to get this code to work for me! Thanks! However, on Opera 68 (WebKit engine, MacOS X Catalina) the `evt.target.result.length` is undefined and breaks everything. However, there is `.byteLength` which does what is needed and used it instead. I haven't checked the `offset += chunkSize` solution but I fear it might break in certain edge (no browser pun intended) cases. – Andrei Rînea Jun 12 '20 at 21:46
  • 2
    Wow, this answer is so old and still keeps getting attention. I should probably stop and rewrite this messy code. Glad it's useful to some. – alediaferia Jun 12 '20 at 22:57
  • Funny thing is, I tried the same as OP but with`await` but only first chunk was read. Here: https://stackoverflow.com/q/62346764/1796 – Andrei Rînea Jun 14 '20 at 14:51
  • 1
    Is creating a `FileReader` for every chunk really necessary? And blindly slicing by bytes might cut a multi-byte character in half and break the encoding. – gre_gor Apr 22 '23 at 17:59
12

You can take advantage of Response (part of fetch) to convert most things to anything else blob, text, json and also get a ReadableStream that can help you read the blob in chunks

var dest = new WritableStream({
  write (str) {
    console.log(str)
  }
})

var blob = new Blob(['bloby']);

(blob.stream ? blob.stream() : new Response(blob).body)
  // Decode the binary-encoded response to string
  .pipeThrough(new TextDecoderStream())
  .pipeTo(dest)
  .then(() => {
    console.log('done')
  })

Old answer (WritableStreams pipeTo and pipeThrough was not implemented before)

I came up with a interesting idéa that is probably very fast since it will convert the blob to a ReadableByteStreamReader probably much easier too since you don't need to handle stuff like chunk size and offset and then doing it all recursive in a loop

function streamBlob(blob) {
  const reader = new Response(blob).body.getReader()
  const pump = reader => reader.read()
  .then(({ value, done }) => {
    if (done) return
    // uint8array chunk (use TextDecoder to read as text)
    console.log(value)
    return pump(reader)
  })
  return pump(reader)
}

streamBlob(new Blob(['bloby'])).then(() => {
  console.log('done')
})
Endless
  • 34,080
  • 13
  • 108
  • 131
  • This is much better than slicing, although you don't get to control the chunk size. (on Chrome, it was 64KiB) – corwin.amber Dec 14 '19 at 16:52
  • 2
    try using the new `blob.stream()` and see what chunk size you get, probably better than wrapping blob in a Response and get a stream directly instead – Endless Dec 14 '19 at 21:52
  • @Endless how can we preview large image file chunk by chunk? So that, DOM not not getting hanged? – Developer Jul 15 '20 at 17:21
9

The second argument of slice is actually the end byte. Your code should look something like:

 function parseFile(file){
    var chunkSize = 2000;
    var fileSize = (file.size - 1);

    var foo = function(e){
        console.log(e.target.result);
    };

    for(var i =0; i < fileSize; i += chunkSize) {
        (function( fil, start ) {
            var reader = new FileReader();
            var blob = fil.slice(start, chunkSize + start);
            reader.onload = foo;
            reader.readAsText(blob);
        })(file, i);
    }
}

Or you can use this BlobReader for easier interface:

BlobReader(blob)
.readText(function (text) {
  console.log('The text in the blob is', text);
});

More information:

Minko Gechev
  • 25,304
  • 9
  • 61
  • 68
  • 1
    Is the loop reliable? I'm rather new to `FileReader` API but I see it is asynchronous. How can we make sure the whole file has been processed completely once the `for loop` ends? – alediaferia Jan 31 '15 at 19:22
  • How can we preview large size image using FileReader? Because, large size of around multiple image file of 800mb around DOM hangs. – Developer Jul 16 '20 at 19:15
  • [`parseFile` doesn't work with multi-byte characters](https://jsfiddle.net/2ec3dmg8/) – gre_gor Apr 22 '23 at 18:16
7

Here is my FileStreamer typescript version here

class FileStreamer {
    constructor(file, encoding = 'utf-8') {
        this.file = file;
        this.offset = 0;
        this.defaultChunkSize = 64 * 1024; // bytes
        this.textDecoder = new TextDecoder(encoding);
        this.rewind();
    }
    rewind() {
        this.offset = 0;
    }
    isEndOfFile() {
        return this.offset >= this.getFileSize();
    }
    async eventPromise(target, eventName) {
        return new Promise((resolve) => {
            const handleEvent = (event) => {
                resolve(event);
            };
            target.addEventListener(eventName, handleEvent);
        });
    }
    async readFile(blob) {
        const fileReader = new FileReader();
        fileReader.readAsArrayBuffer(blob);
        const event = await this.eventPromise(fileReader, 'loadend');
        const target = event.target;
        if (target.error) {
            throw target.error;
        }
        return target.result;
    }
    async readBlockAsText(length = this.defaultChunkSize) {
        const blob = this.file.slice(this.offset, this.offset + length);
        const buffer = await this.readFile(blob);
        const decodedText = this.textDecoder.decode(buffer, { stream: true });
        this.offset += blob.size;

        if (this.isEndOfFile()) {
            const finalText = this.textDecoder.decode();
            if (finalText) {
                return decodedText + finalText;
            }
        }
        return decodedText;
    }
    getFileSize() {
        return this.file.size;
    }
}

Example printing a whole file in the console (within an async context)

    const fileStreamer = new FileStreamer(aFile);
    while (!fileStreamer.isEndOfFile()) {
      const data = await fileStreamer.readBlockAsText();
      console.log(data);
    }
Flavien Volken
  • 19,196
  • 12
  • 100
  • 133
  • Thanks, very handy. Did you test it? Any corrections? – Leo Apr 30 '18 at 15:09
  • 1
    @Leo I am using it in one of my projects and yes it's working fine. Note that all those answer might be deprecated sooner or later by [Streams API](https://developer.mozilla.org/en-US/docs/Web/API/Streams_API). One thing I could improve would be to add the ability to pass an optional encoding parameter to the [fileReader.readAsText function](https://developer.mozilla.org/en-US/docs/Web/API/FileReader/readAsText) – Flavien Volken May 01 '18 at 14:47
  • Hm, I am going to use it for binary files. Can I just replace `readAsText` with `readAsArrayBuffer`? Or is it safe to use UTF-8 for reading (and output)? – Leo May 01 '18 at 23:40
  • 2
    Yes you can use readAsArrayBuffer, or just take my ts version [here](https://gist.github.com/Xample/c1b7664ba33e09335b94379e48a00c8e) – Flavien Volken May 02 '18 at 06:01
  • @Flavienvolken how we preview large image file chunk by chunk? So that DOM not getting hanged? E.g each image has 25mb in size with about 600mb of image to preview at a time ? – Developer Jul 15 '20 at 17:29
  • If your image is not compressed for instance a bmp or any other file format, then you might create a tile picking only the chunk of data you need. If your image is compressed this is a completely different problem… for instance a codec like the jpeg2000 relies on the entire image data to build a 1/1 ratio (full quality) tile. – Flavien Volken Jul 21 '20 at 14:46
  • [This doesn't work with multi-byte characters](https://jsfiddle.net/kz05dy3u/) – gre_gor Apr 22 '23 at 18:13
  • @gre_gor did you try changing the encoding to utf-16 ? Something like readAsText(blob, 'utf-16') https://stackoverflow.com/a/58965205/532695 – Flavien Volken Apr 23 '23 at 19:21
  • @FlavienVolken Did you? It fails even worse. It causes an infinite loop as offset never reaches the blob size. `result.length` is number of characters, not number of bytes. And UTF-16 isn't fixed-byte width encoding either. `""` in UTF-8 is 6 bytes, 4 in UTF-8 and JS counts it as 2 characters. I would expect text handling code to be able to handle the default encoding anyway. – gre_gor Apr 23 '23 at 19:59
  • @gre_gor Okay, then also try replacing `this.offset += result.length;` with `this.offset += blob.size;` to prevent looping forever, please tell me if it works as expected on your side. – Flavien Volken Apr 27 '23 at 07:18
  • @FlavienVolken [It stops the looping, but still cuts the characters](https://jsfiddle.net/qrxn4o2y/) – gre_gor Apr 27 '23 at 16:04
  • 2
    @gre_gor ah ha! we then need to use `textDecoder.decode(buffer, {stream: true})`. Check the updated answer, [it should work now](https://jsfiddle.net/erum74ad/) – Flavien Volken Apr 28 '23 at 06:22
  • @FlavienVolken Yeah, it works now. But I think the default encoding should be UTF-8. – gre_gor Apr 28 '23 at 17:48
5

Parsing the large file into small chunk by using the simple method:

                //Parse large file in to small chunks
                var parseFile = function (file) {

                        var chunkSize = 1024 * 1024 * 16; //16MB Chunk size
                        var fileSize = file.size;
                        var currentChunk = 1;
                        var totalChunks = Math.ceil((fileSize/chunkSize), chunkSize);

                        while (currentChunk <= totalChunks) {

                            var offset = (currentChunk-1) * chunkSize;
                            var currentFilePart = file.slice(offset, (offset+chunkSize));

                            console.log('Current chunk number is ', currentChunk);
                            console.log('Current chunk data', currentFilePart);

                            currentChunk++;
                        }
                };
Community
  • 1
  • 1
Radadiya Nikunj
  • 988
  • 11
  • 10
-1

Getting chunks by blindly slicing a blob by bytes, can cut a multi-byte character in half and break the encoding.

Since handling character boundaries by yourself is a nightmare, you can use TextDecoderStream to help with that.

Here is a solution implemented as an async generator function:

async function* read_chunks(file, chunk_size=1000000, encoding=undefined) {
  let offset = 0;
  const stream = new ReadableStream({
    async pull(controller) {
      let chunk = file.slice(offset, offset + chunk_size);
      chunk = await chunk.arrayBuffer();
      chunk = new Uint8Array(chunk);
      controller.enqueue(chunk);
      if (offset >= file.size) {
        controller.close()
      }
      offset += chunk.length;
    }
  }).pipeThrough(new TextDecoderStream(encoding));
  const reader = stream.getReader();
  for (;;) {
    const { done, value } = await reader.read();
    if (done) return;
    yield value;
  }
}

const file = new Blob(["000001002003"]);
(async() => {
  for await (const chunk of read_chunks(file, 4)) {
    console.log(`Chunk: [${chunk.length}] "${chunk}"`);
  }
})();

You can get rid of the custom ReadableStream, by replacing it with file.stream(), if you don't care about the size of the chunks.

gre_gor
  • 6,669
  • 9
  • 47
  • 52