Reading a text file from the client and on the client that exceeds the maximum size of a single string in javascript

Question

I'd like to reverse the following steps performed on the client in javascript but am having trouble with the blob.

In an indexedDB database, over an open cursor on an object store index:

Extracted data object from database.
Converted object to string with JSON.stringify.
Made new blob { type: 'text/csv' } of the JSON string.
Wrote blob to an array.
Moved cursor down one and repeated from step 1.

After the transaction completed successfully, a new blob of same type was made from the array of blobs.

The reason for doing it this way is that the concatenation of the JSON strings exceeded the maximum permitted size for a single string; so, couldn't concatenate first and make one blob of that large string. However, the array of blobs could be made into a single blob of greater size, approximately 350MB, and downloaded to the client disk.

To reverse this process, I thought I could read the blob in and then slice it into the component blobs, and then read each blob as a string; but I can't figure out how to do it.

If the FileReader is read as text, the result is one large block of text that cannot be written to a single variable because it exceeds the maximum size and throws an allocation size overflow error.

It appeared that reading the file as an array buffer would be an approach allowing for slicing the blob into pieces, but there seems to be an encoding issue of some kind.

Is there a way to reverse the orignal process as is, or an encoding step that can be added that will allow the array buffer to be converted back to the original strings?

I tried reading over some questions that appeared to be related but, at this point, I don't understand the encoding issues they were discussing. It seems that it is rather complicated to recover a string.

Thank you for any guidance you can provide.

Additional information after employing accepted answer

There's certainly nothing special about my code posted below but I figured I'd share it for those who may be as new to this as me. It is the accepted answer integrated into the asnyc function used to read the blobs, parse them, and write them to the database.

This method uses very little memory. It is too bad there isn't a way to do the same for writing the data to disk. In writing the database to disk, memory usage increases as the large blob is generated and then released shortly after the download completes. Using this method to upload the file from the local disk, appears to work without ever loading the entire blob into memory before slicing. It is as if the file is read from the disk in slices. So, it is very efficient in terms of memory usage.

In my specific case, there is still work to be done because using this to write the 50,000 JSON strings totalling 350MB back to the database is rather slow and takes about 7:30 to complete.

Right now each individual string is separately sliced, read as text, and written to the database in a single transaction. Whether slicing the blob into larger pieces comprised of a set of JSON strings, reading them as text in a block, and then writting them to the database in a single transaction, will perform more quickly while still not using a large amount of memory is something I will need to experiment with and a topic for a separate question.

If use the alternative loop that determines the number of JSON strings needed to fill the size const c, and then slice that size blob, read it as text, and split it up to parse each individual JSON string, the time to complete is about 1:30 for c =250,000 through 1,000,000. It appears that parsing a large number of JSON strings still slows things down regardless. Large blob slices don't translate to large amounts of text being parsed as a single block and each of the 50,000 strings needs to be parsed individually.

   try

     {

       let i, l, b, result, map, p;

       const c = 1000000;


       // First get the file map from front of blob/file.

       // Read first ten characters to get length of map JSON string.

       b = new Blob( [ f.slice(0,10) ], { type: 'text/csv' } ); 

       result = await read_file( b );

       l = parseInt(result.value);


       // Read the map string and parse to array of objects.

       b = new Blob( [ f.slice( 10, 10 + l) ], { type: 'text/csv' } ); 

       result = await read_file( b );

       map = JSON.parse(result.value); 


       l = map.length;

       p = 10 + result.value.length;


       // Using this loop taks about 7:30 to complete.

       for ( i = 1; i < l; i++ )

         {

           b = new Blob( [ f.slice( p, p + map[i].l ) ], { type: 'text/csv' } ); 

           result = await read_file( b ); // FileReader wrapped in a promise.

           result = await write_qst( JSON.parse( result.value ) ); // Database transaction wrapped in a promise.

           p = p + map[i].l;

           $("#msg").text( result );

         }; // next i


       $("#msg").text( "Successfully wrote all data to the database." );


       i = l = b = result = map = p = null;

     }

   catch(e)

     { 

       alert( "error " + e );

     }

   finally

     {

       f = null;

     }



/* 

  // Alternative loop that completes in about 1:30 versus 7:30 for above loop.


       for ( i = 1; i < l; i++ )

         { 

           let status = false, 

               k, j, n = 0, x = 0, 

               L = map[i].l,

               a_parse = [];



           if ( L < c ) status = true;

           while ( status )

             {

               if ( i+1 < l && L + map[i+1].l <= c ) 

                 {

                   L = L + map[i+1].l;

                   i = i + 1;

                   n = n + 1;

                 }

               else

                 {

                   status = false;

                 };

             }; // loop while


           b = new Blob( [ f.slice( p, p + L ) ], { type: 'text/csv' } ); 

           result = await read_file( b ); 

           j = i - n; 

           for ( k = j; k <= i; k++ )

             {

                a_parse.push( JSON.parse( result.value.substring( x, x + map[k].l ) ) );

                x = x + map[k].l;

             }; // next k

           result = await write_qst_grp( a_parse, i + ' of ' + l );

           p = p + L;

           $("#msg").text( result );

         }; // next i



*/



/*

// Was using this loop when thought the concern may be that the JSON strings were too large,
// but then realized the issue in my case is the opposite one of having 50,000 JSON strings of smaller size.

       for ( i = 1; i < l; i++ )

         {

           let x,

               m = map[i].l,

               str = [];

           while ( m > 0 )

             {

               x = Math.min( m, c );

               m = m - c;

               b = new Blob( [ f.slice( p, p + x ) ], { type: 'text/csv' } ); 

               result = await read_file( b );

               str.push( result.value );

               p = p + x;

             }; // loop while


            result = await write_qst( JSON.parse( str.join("") ) );

            $("#msg").text( result );

            str = null;

         }; // next i
*/

Kaiido · Accepted Answer · 2018-06-22T06:49:46.463

1

Funnilly enough you already said in your question what should be done:

Slice your Blob.

The Blob interface does have a .slice() method.
But to use it, you should keep track of the positions where your merging occurred. (could be in an other field of your db, or even as an header of your file:

function readChunks({blob, chunk_size}) {
  console.log('full Blob size', blob.size);
  const strings = [];  
  const reader = new FileReader();
  var cursor = 0;
  reader.onload = onsingleprocessed;
  
  readNext();
  
  function readNext() {
    // here is the magic
    const nextChunk = blob.slice(cursor, (cursor + chunk_size));
    cursor += chunk_size;
    reader.readAsText(nextChunk);
  }
  function onsingleprocessed() {
    strings.push(reader.result);
    if(cursor < blob.size) readNext();
    else {
      console.log('read %s chunks', strings.length);
      console.log('excerpt content of the first chunk',
        strings[0].substring(0, 30));
    }
  }
}



// we will do the demo in a Worker to not kill visitors page
function worker_script() {
  self.onmessage = e => {
    const blobs = [];
    const chunk_size = 1024*1024; // 1MB per chunk
    for(let i=0; i<500; i++) {
      let arr = new Uint8Array(chunk_size);
      arr.fill(97); // only 'a'
      blobs.push(new Blob([arr], {type:'text/plain'}));
    }
    const merged = new Blob(blobs, {type: 'text/plain'});
    self.postMessage({blob: merged, chunk_size: chunk_size});
  }
}
const worker_url = URL.createObjectURL(
  new Blob([`(${worker_script.toString()})()`],
    {type: 'application/javascript'}
  )
);
const worker = new Worker(worker_url);
worker.onmessage = e => readChunks(e.data);
worker.postMessage('do it');

edited Jun 22 '18 at 06:49

answered Jun 22 '18 at 06:44

Kaiido

123,334
13
219
285

Thank you for the information on how to slice the blob and read it in chunks. I have a data map array at the front of the file that will assist in where to slice. However, the item I still am not understanding is how to read the blob from disk before slicing. It can't just be read as text and assigned to a variable because it throws an allocation size overflow error. I tried placing reader.result in a new Blob statement and the same error occurs. Should it be read as an array buffer and viewed as a Uint8Array? If so, then that would already do the "slicing", yes? – Gary Jun 22 '18 at 19:04
The point is to not read it before slicing. You read only the sliced parts. – Kaiido Jun 23 '18 at 01:41
You build the blob named 'merge' in the browser, slice it, and read each slice using FileReader. In my case, the large file is on the client disk and first needs to be read into the browser. SInce it exceeds the maximum allowed size for a single string, using FIleReader to read the file as text throws an allocation size overflow error. The only way I found, so far, of reading in the file is as an array buffer. I'm working on writing slices of reader.result to a new Uint8Array() and using fromCharCode to convert that back to a string. If I could read the blob in as text, I'd use your method. – Gary Jun 23 '18 at 02:52
@Gary But the Blob you get from user disk is exactly like my *merged* Blob, I even made it come from a Worker to show how the creation is not linked to the reading part. In the reading part, the blob is the full one, just like yours. But, we read it only by chunks. We never read the whole Blob in its entirety. Blob.slice returns a new Blob, representing only the chunk that has been sliced and this is what we read. – Kaiido Jun 23 '18 at 03:45
So for example if we have a merged Blob made from ["abcd"], merged.slice(0,1) will be a new Blob made of ["a"] – Kaiido Jun 23 '18 at 03:48
Thanks. I understand the concept of slicing the blob and then reading each slice as text. That is, once the blob is in the browser. The blob can't be read from disk into the browser in slices, can it? Doesn't the entire blob have to be read from disk into the browser first and then sliced? The const 'merged' in your example is the entire blob. I can't get the blob as text from disk into a variable like 'merged' inorder to slice it; because the file is 350MB and the max is somewhere around 256MB for a text string. The only way I can get the blob to disk is readAsArrayBuffer. – Gary Jun 23 '18 at 04:37
The blob binary data stays in the memory, but is not tied to string max length. The only thing that is subject to this limit are the strings, and here the strings are only generated from the slices. So just like on this example, your file is fully in browser's memory, but since we generate strings only from chunks of it, we don't face string max length limit. – Kaiido Jun 23 '18 at 06:09
I apologize. It finally hit me. It's about 3am here. I've been so stupid. For some reason I kept thinking that I had to use FileReader to get the file from the disk, even though I've been passing the blob to my FileReader promise all along. Too many things in my small brain at one time I guess. I learned a good bit about array buffers today though. I understand now and what you gave me works great. I accepted it as the answer also. Thanks for not giving up on me. – Gary Jun 23 '18 at 06:57

Reading a text file from the client and on the client that exceeds the maximum size of a single string in javascript

1 Answers1

Linked