27

I am trying to access the first few lines of text files using the FileApi in JavaScript.

In order to do so, I slice an arbitrary number of bytes from the beginning of the file and hand the blob over to the FileReader.

For large files this takes very long, even though, my understanding currently is that only the first few bytes of the file need to be accessed.

  • Is there some implementation in the background that requires the whole file to be accessed before it can be sliced?
  • Does it depend on the browser implementation of the FileApi?

I currently have tested in both Chrome and Edge (chromium).

Analysis in Chrome using the performance dev tools shows a lot of idle time before the reader.onloadend and no increase in ram usage. This might be however, because the FileApi is implemented in the Browser itself and does not reflect in the JavaScript performance statistics.

My implementation of the FileReader looks something like this:

const reader = new FileReader();

reader.onloadend = (evt) => {
  if (evt.target.readyState == FileReader.DONE) {
    console.log(evt.target.result.toString());
  }
};

// Slice first 10240 bytes of the file
var blob = files.item(0).slice(0, 1024 * 10);

// Start reading the sliced blob
reader.readAsBinaryString(blob);

This works fine but as described performs quite underwhelmingly for large files. I tried it for 10kb, 100mb and 6gb. The time until the first 10kb are logged seems to correlate directly to the file size.

Any suggestions on how to improve performance for reading the beginning of a file?


Edit: Using Response and DOM streams as suggested by @BenjaminGruenbaum does sadly not improve the read performance.

var dest = newWritableStream({​​​​​​​​
    write(str) {​​​​​​​​
        console.log(str);
    }​​​​​​​​,
}​​​​​​​​);
var blob = files.item(0).slice(0, 1024 * 10);

(blob.stream ? blob.stream() : newResponse(blob).body)
// Decode the binary-encoded response to string
  .pipeThrough(newTextDecoderStream())
  .pipeTo(dest)
  .then(() => {​​​​​​​​
      console.log('done');
  }​​​​​​​​);

Mark Schultheiss
  • 32,614
  • 12
  • 69
  • 100
kacase
  • 1,993
  • 1
  • 18
  • 27
  • 2
    Hey, does using a [`Response` and DOM streams](https://stackoverflow.com/a/37599682/1348195) help? I am not sure why `readAsBinarySring` is slow here since using `.slice` on the blob is supposed to only read the part you want - however what you are describing indicates that indeed it's waiting for the whole file. – Benjamin Gruenbaum Feb 17 '21 at 15:29
  • @BenjaminGruenbaum reading the file using Response and DOM streams works, but does sadly not improve the read performance for large files. – kacase Feb 17 '21 at 15:47
  • @BenjaminGruenbaum I added the DOM Stream implementation to the question. – kacase Feb 17 '21 at 15:53
  • [Can't repro](https://jsfiddle.net/3h1urgom/1/) here on macOS with an SSD drive. Could you show exactly what you do measure and how? Where are your files stored? What happens when you use data from memory (`new Blob([await file.arrayBuffer()])`)? Browsers have to take a "snapshot" of the File when first accessed, but I think that generally only the lastModified field is used for this, though your OS may also take more time to access the file's metadata for bigger files. – Kaiido Feb 23 '21 at 09:28
  • Hi @Kaiido we measured it using the "performance" tab in chrome and analyzing the snapshot. We were able to reproduce the same problem in your stackblitz. However, the timer you set is not affected. It seems like the `onchange` event is only called after some file operation occured and this file operation increases with file size. The time between adding the file and the onchange event being fired is affected by file size. – kacase Feb 24 '21 at 12:35
  • 4
    So the FileReader has nothing to do with it? Why not make it clear in the question? For me this really just sounds like your OS takes all this time to touch the file and produce the metadata. Nothing slice() can change I'm afraid. As to why your OS makes the time it takes relative to the file size, I have no clue. Might be worth testing on other environments, with other harddrive, other file system etc. – Kaiido Feb 24 '21 at 14:00
  • @Kaiido apparently so. I will update the Question. It seems to be an issue with the `input` of type File. I will do some further testing and update accordingly. – kacase Feb 24 '21 at 14:03
  • I think time is mainly spent before step 'read'. You can consider focusing on the acquisition process of 'get files'. – 小聪聪到此一游 Feb 26 '21 at 03:21
  • 1
    And reading the beginning of a file during loading file, rather than after. – 小聪聪到此一游 Feb 26 '21 at 03:27
  • On what platform do you experience this issue and with which type of file? – Dan Macak Feb 27 '21 at 12:51
  • It seems we will have a request for change coming from this question (just watching how it goes, I could reproduce and couldn't find a way to work around it) – Bob Feb 28 '21 at 21:31
  • Maybe useful; I definitely learned something here: https://stackoverflow.com/questions/14438187/javascript-filereader-parsing-long-file-in-chunks?noredirect=1&lq=1#answer-28318964 I take it you've seen this? – Todd Mar 02 '21 at 01:22
  • So your main issue is the input taking longer to process the file before the following FileReader code executes. However, for knowledge sake, if you need to process large amount of data or expensive operations, you should consider using a web worker (https://www.html5rocks.com/en/tutorials/file/filesystem-sync/) – Pierre Burton Mar 02 '21 at 08:21
  • i suggest you may do that in bulks of 1024b in for loop of 1-10 i think it can change the preformance. of course you should change the start and end positions in the slice method for each iteration. – Itay wazana Mar 02 '21 at 08:44
  • i think you must implement chuncked mode on server side first. – tdjprog May 04 '23 at 18:05

2 Answers2

0

Just for kicks, here it is with a worker thread and File System Access API

No idea if either of those things help, I have no 6gb files. This will get the reading off the main thread so that does help performance in some sense.

Object.assign(
    new Worker(
        URL.createObjectURL(
            new Blob(
                [
                    `self.onmessage = async (e) =>` +
                    `    void postMessage(` +
                    `        (new FileReaderSync())` +
                    `            .readAsText(` +
                    `                (await e.data.getFile())` +
                    `                    .slice(0,1024*10)` +
                    `            )` +
                    `    );`
                ],
                { type: 'application/javascript' }
            )
        )
    ),
    { onmessage: (e) => void console.log(e.data) }
).postMessage(
    (await window.showOpenFilePicker(
        { mode: 'read', startIn: 'documents' }
    )).pop()
);

edit:

forgor but you need chromium for this rn sorry (tested on edge) also this wont run in a jsfiddle because web worker blah blah security blah blah. You can copy paste it into the console on google though. for some reason the headers dont prevent thins from running. If this actually does help please actually put the worker in its own file (and reformat my artistic negative space triangle out of existence)

Bobby Morelli
  • 460
  • 2
  • 7
-2

how about this!!

function readFirstBytes(file, n) {
  return new Promise((resolve, reject) => {
    const reader = new FileReader();
    reader.onload = () => {
      resolve(reader.result);
    };
    reader.onerror = reject;
    reader.readAsArrayBuffer(file.slice(0, n));
  });
}

readFirstBytes('file', 10).then(buffer => {
  console.log(buffer);
});
Anuj Shah
  • 537
  • 4
  • 11
  • 2
    How does this improve the read performance of the first N bytes? Did you test it with multiple file sizes? Why should this approach be any different than the ones described in my post? – kacase Apr 30 '22 at 21:49