Best way to read from a big CSV file without loading everything to memory using Javascript

Question

I'm using Atom/Electron for building an app that has visualisations over video based on data. Each video has a corresponding CSV file with information for each frame. The videos are about 100 minutes, so the files has plenty of data!

The problem I'm having is that it takes a couple of seconds to load and parse the file. Most of the time this is not a problem. But I need to make playlist of parts of videos, and loading the whole CSV file each time a video is changed is not a viable option.

I been looking to file streaming options as fast-csv, but I didn't manage to start reading for an arbitrary part of the file.

EDIT: from the FS documentation. In this case, the question is how can I know which byte corresponds to the position I want in the file?

options can include start and end values to read a range of bytes from the file instead of the entire file. Both start and end are inclusive and start at 0.

What do you think would be the better and most performant approach to this situation?

In concrete:

Is there a way of starting to read a stream from any part of a CSV file?

Do you consider there is another storage method that would allow me to solve this problem better?

UPDATE:

In the end, I solved this by storing the data in a file in binary format. Since I know how many columns the file has I can just read straight from the segment of the file without any performance implications.

Are the column widths fixed in the csv? (e.g. column 1 is always 20 characters) — Devin H., Jun 22 '15 at 19:18
This question may be of use: http://stackoverflow.com/questions/6156501/read-a-file-one-line-at-a-time-in-node-js I don't know enough about node to answer for sure, but it seems to me you'd want to send on line of the csv at a time. — Devin H., Jun 22 '15 at 19:22
@DevinH. no, but each column is always a number, so it wouldn't be difficult to use some kind of padding. Ideally I would prefer not to do this. I was thinking of indexing the byte position of each row. And store an arrays with the positions. It wouldn't be a problem making some processing when the file is loaded for the first time. — limoragni, Jun 22 '15 at 19:23
Specify buffer chunk on each read? That way you only read small chunks at a time (I'm not sure but looks like my first approach) — Bwaxxlo, Jun 22 '15 at 22:31
@Bwaxxio I think you've misunderstood the question. He seems to be trying to random-access small parts of the file, not stream the whole thing. — David Knipe, Jun 22 '15 at 22:58
Does one of the CSV columns tel you how far through you are? If so then you could estimate it, read that part of the file, then try again until you get it right. Maybe combine this with your idea - just index a few of the rows and estimate from those. (Indexing every row could be quite memory-intensive.) — David Knipe, Jun 22 '15 at 23:03
Now which video frame corresponds with which row is given by the row number, but the row number is not one of the columns, maybe I can add that information to each column and I could estimate given the csv length where to start streaming, then create a buffer and try to read from the buffer. @Bwaxxlo I want to be able to seek in the video and go to the corresponding frame data in the csv without having to read the whole file first. — limoragni, Jun 23 '15 at 14:25
@limoragni how big is that file? I believe you want something not needed. In other case you have 3 options: 1) create an index of each column; 2) use static width of the column; 3) start to use normal database, e.g. sqlite in your case will be enough. — Dmitry Polushkin, Jul 18 '15 at 22:25
Why do you have separate CSV file for each video file? Why not to bundle them into one csv file? — Dmitry Polushkin, Jul 18 '15 at 22:27
@DmitryPolushkin I don't always know what's the next video. It could even be a fragment of a video. The user makes playlists with fragments. And the data is created on a per video basis, each week we have to send the data and the video to the users of the software. Anyway how I see it one big csv would be even more problematic. I could use a database, the one I tried so far was nedb, but it takes to much time loading the database to memory, that is what nedb does, does sql lite reads directly from disk? I'll look into it! — limoragni, Jul 19 '15 at 00:06
@limoragni - It sounds like Sqlite is exactly what you need. May not be the ultimate solution but it will definitely get you going a lot sooner. It is extremely optimized and can even be used to run large websites (although that is discouraged...but fun). Side note. Maybe disect Sqlite a little bit, you may find its inner workings useful to your own solutions. — Blizzardengle, Jul 21 '15 at 17:45
@Blizzardengle Do you think sql little can handle a 25 rows per second query. And it would do this reading from disk? Or do I have to load the whole database to memory at some point? Maybe you can put this in an answer it seem the more suitable solution to this problem if it works like that! — limoragni, Jul 24 '15 at 00:10
Rather than trying to work around the data to meet the requirements I would consider whether or not you can modify the data to meet the demand. [Here's a python script to convert your CSV to sqlite.](https://gist.github.com/rgrp/5199059#file-csv2sqlite-py) — davidcondrey, Jul 24 '15 at 07:18

score 3 · Answer 1 · answered Jul 23 '15 at 17:00

3

I would highly recommend Papaparse for this. It allows the streaming of a CSV 'row-by-row', which can be processed in JSON format based on headers in the file.

Within a config object passed to the parsing function, you can give a 'step' parameter, which is a function to be carried out for each row of the file as it steps through.

Note: Can also be configured to use a worker-thread for increased performance when handling very large CSV's

http://papaparse.com/docs

answered Jul 23 '15 at 17:00

locksem

315
3
16

Ok, it seem cool, but I think is no different than something like fast-csv or csv-parser. It doesn't give me a way to start streaming from an arbitrary row of the csv. If I have to start streaming always from the beginning, the solution doesn't fit my case. Do you know if there is a way of jumping to a given row? – limoragni Jul 24 '15 at 00:06
If for instance, you required information that resided in row 500 of the file, your 'step' function would simply catch this, and wouldn't do any processing for the other rows. Technically, in order to begin streaming from a particular point in the file, the beginning of the file would still have needed to be loaded in some capacity to know you've reached your required point. If no processing occurs on a particular 'step', it moves through the file incredibly quickly - so i wouldn't be concerned about performance issues here. – locksem Jul 31 '15 at 15:55
I had errors using fast-csv to parse a large csv in nodejs, no errors in this library though (with the same csv file) https://github.com/wdavidw/node-csv – steampowered Dec 17 '15 at 00:23

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

Per my comment, Sqlite seems to be what your looking for. It may not be your permanent solution in the long run but it will certainly work for the time being while you decided if you want to stick with it or code your own solution.

Inner Workings of Sqlite

Sqlite is optimized to the core but it has three major features that cause it to execute faster than normal disk reads, especially CSV files:

The entire database (every database you create) is stored in 1 single file, not multiple files or records.
This file is paged into 1024 byte (1K) chunks allowing you to jump easily around the data.
(Really part of 2) The entire database and paging system is one massive binary tree that usually takes under 10 jumps to find any given data. So in layman's terms, extremely fast!

If you are really interested in understanding the full scope of all of this, I have found no better explanation than this amazing blog post by Julia Evans.

Possible Drawbacks

Besides the inner workings, Sqlite is designed to be client side working on the users machine. If this is not a viable solution there are workarounds that can be done. Sqlite can be used as a web server for example but it really thrives in a stand alone or mixed installation best. Also remember every clients computer is different. One computer may process records faster than the next but in general you do not need to worry since client side computers are usually under little load.

Standalone would require everything to be on the clients side. This is typically how Sqlite is used. I have used it for games in the past, utilizing sqlite4java's API to connect to the database with Java; the API made the whole experience feel like PHP and MySQL on a server. You may need to find other API's since Sqlite is written in C.
Mixed instillation is done the same way as standalone but you code into your program a link to an actual server. For the games I helped make we would track things like scores and user data, and then periodically in the background pass this to an actual server if we could get a connection. This also works in reverse. You can start the user with nothing but on first run it can download everything you need and from then on keep its self up-to-date with what is on the server.

Summary

Sqlite will work for what you need but may require a little homework to setup in a fashion you need. Sqlite4java for example is easy to install but confusing to learn since their documentation is so poor; Stack Overflow got me through it though. Sqlite is also a use it and forget it type of installation so to answer your question, it will handle 25 rows a second like cake, you do not need to worry about optimizing it only your own code.

Best way to read from a big CSV file without loading everything to memory using Javascript

2 Answers2