0

I'm developing a CSV parser that should be able to deal with huge datasets (read 10 million rows) in the browser.

Basically the parser works as follows:

  1. Main thread reads chunk of 20MB, otherwise the browser would crash quickly. After that, sends the chunk of data read to one of the workers.

  2. The worker receives the data and discards the columns I don't want and saves the ones I want. Normally I only want 4-5 columns out of 20-30.

  3. The worker sends the processed data back to the main thread.

  4. The main thread receives the data and saves it in the data array.

  5. Repeat steps 1-4 until file is done.

At the end with the dataset (crimes city of chicago), I end up with an array that has inside of it 71 other arrays and each of these arrays contains +/- 90K elements. Each of these 90K elements contains 5 strings (columns that were taken from the read file). Namely latitude, longitude, year, block and IUCR.

Summarizing, 71 is the number of chunks of 20MB in the dataset, 90K is the number of rows in each chunk of 20MB and 5 is the columns that were extracted.

I noticed that the browser (Chrome) was using too much memory, so I tried in 4 different browsers (Chrome, Opera, Vivaldi and Firefox), and recorded the memory used by the tab.

  1. Chrome - 1.76GB
  2. Opera - 1.76GB
  3. Firefox - 1.3GB
  4. Vivaldi - 1GB

If I try to recreate the same array but with mock data, it only uses approx. 350MB of memory.:

var data = [];
for(let i = 0; i < 71; i++){
    let rows = [];
    for(let j = 0; j < 90*1000; j++){
        rows.push(["029XX W MADISON ST", "2027", "-87.698850575", "2001", "41.880939487"])
    }
    data.push(rows);
}

I understand that if the array is static, as seen in the code above, it's easier to perform better than the dynamic case. But I wasn't expecting to use 5 times more memory for the same quantity of data.

There's anything I can do to use less memory on the parser?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Rui Alves
  • 75
  • 2
  • 8
  • I really don't think you should be trying to do this in the browser. –  May 21 '17 at 17:09
  • You may run this on a server (maybe nodejs), it doesnt rely on changing environments/memory and has a good implementation called stream... – Jonas Wilms May 21 '17 at 17:11
  • @torazaburo you're probably righ. Can you give me reasons why i shouldn't do it? – Rui Alves May 21 '17 at 17:14
  • @Jonasw the problem is i'm developing a client side API to create thematic maps. It is important to do it client-side. But i know there's limits for client-side and i want to test those limits to then be able to say that after say 2 million rows it shouldn't be made only client side. – Rui Alves May 21 '17 at 17:18
  • I would store it serverside, then pass querys etc to the server and return small chunked results... However i cant tell you why the memiry consumtion is that high... – Jonas Wilms May 21 '17 at 17:21
  • the memory approx 1.5G is about all process or only for step 4? – Pablo Cesar Cordova Morales May 21 '17 at 17:22
  • By the way, do your workers recycle their memory properly? That might be the problem... – Jonas Wilms May 21 '17 at 17:22
  • @PabloCesarCordovaMorales the memory is approx. 1.5GB at the end. When everything is done and i have the data extracted from the file. – Rui Alves May 21 '17 at 17:23
  • @Jonasw i think so. See [worker.js gist](https://gist.github.com/iursevla/0d4c5e38f621aea3fa6d28f184875945). The main thread even sends terminate() to all workers when the job is done. – Rui Alves May 21 '17 at 17:27
  • 1
    You say that you get 4-5 columns, and where push your data is variable for 4-5 columns, I know that v8 web engine is optimum for same type objects, maybe you can force to have 5 columns always to optimize your process: https://www.youtube.com/watch?v=p-iiEDtpy6I&t=1241s – Pablo Cesar Cordova Morales May 21 '17 at 17:44
  • @PabloCesarCordovaMorales i tried to create an object with 5 fields but didn't improve much. That was your idea or i did not understood? – Rui Alves May 21 '17 at 19:40
  • 1
    Your results are rather weird given that Vivaldi, Opera and Chrome use the same engine. Maybe different versions? Or the Vivaldi team just tuned the settings towards low memory footprint. – Bergi May 23 '17 at 09:57
  • 1
    I guess your static array takes much less memory because [the strings are interned](https://stackoverflow.com/q/5276915/1048572) so those 639K strings all references sharing the same data memory, unlike the many different strings in your actual data. – Bergi May 23 '17 at 09:59
  • @Bergi you're correct. I already made some improvements on my code and now it uses much less memory. – Rui Alves May 23 '17 at 10:55

1 Answers1

0

Basically to use less memory one can use some techniques.

First, columns of the CSV that contain numbers should be converted and used as such. Since numbers in Javascript take 8 bytes but the same number as a string can take much more space (2 bytes per char).

Another thing is to terminate all workers when the job is done.

Rui Alves
  • 75
  • 2
  • 8