Best way to process huge .csv

Question

I need to process a pretty huge .css (at least 10 millions rows, hundred of columns) with Python. I'd like:

To filter the content based on several criteria (mostly strings, maybe some regular expressions)
To consolidate the filtered data. For instance, grouping them by date, and for each date counting occurences based on a specific criterium. Pretty similar to what a pivot table could do.
I'd like to have an user-friendly access to that consolidated data
I'd like to generate charts (mostly basic line charts)
Processing must be fast AND light, because computers at work cannot handle much and we're always in a hurry

Given these prerequisites, could you please suggest some ideas? I thought about using pandas. I also thought about dumping the csv into a SQLite database (because it may be easier to query if I code an User Interface). But it is really my first foray into this world, so I don't know where to start. I don't have much time, but I'll would be very glad if you could offer some pieces of advice, some good (and fresh) things to read etc, interesting libs and so forth. Sorry if Stackoverflow is not the best place to ask for this kind of help. I'll delete the post if needed. Regards.

Please read [how to make good reproducible pandas examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and edit your post correspondingly. — MaxU - stand with Ukraine, Oct 11 '17 at 20:43
**Don't read all data into a list and then process it.** Process your rows as you produce them. If you need to filter the data first, use a generator function — Chetan_Vasudevan, Oct 11 '17 at 20:43

Vitaliy Fedorchenko · Answer 1 · 2017-10-17T15:10:45.933

1

There are 2 rather different situations:

when your reports (charts, pivot tables) use limited number of columns from orignal CSV, and you can pre-aggregate your large CSV file only once to get much smaller dataset. This one-time processing can take some time (minutes) and no need to load whole CSV into memory as it can be processed as data stream (row-by-row). After that you can use this small dataset for fast processing (filtering, grouping etc).
you don't know which columns of original CSV may be used for grouping and filtering, and pre-aggregation is not possible. In other words, all 10M rows should be processed in the real-time (very fast) - this is OLAP use-case. This is possible if you load CSV data into memory once, and then iterate over 10M rows quickly when needed; if this is not possible, only option is to import it into the database. SQLite is a good lightweight DB and you can easily import CSV with sqlite3 command line tool. Note that SQL queries for 10M rows might be not so fast, and possibly you'll need to add some indexes.

Another option might be using specialized OLAP database like Yandex ClickHouse - you can use it to query CSV file directly with SQL (table engine=FILE) or import CSV into its column store. This database is lightning fast with GROUP BY queries (it can process 10M rows in <1s).

edited Oct 17 '17 at 15:10

answered Oct 13 '17 at 09:02

Vitaliy Fedorchenko

8,447
3
37
34

On the contrary, even when the data fits in memory, it's a lot better to process the data as it's being read. It's actually *faster* that waiting for everything to get loaded in memory, *then* process it. That's why databases are often *faster* than in-memory processing. BTW CSV importing programs don't load everything in the DB's memory either – Panagiotis Kanavos Oct 16 '17 at 09:55
@PanagiotisKanavos this depends on usage scenario. Imagine that you need to execute many random OLAP queries to the large CSV - processing it as stream to handle each query could be time consuming (several minutes for 5-6 GB CSV). Of course for one-time aggregations no need to load whole CSV into RAM as stream processing will take approximately the same time - in both cases most time is taken by file read, CSV parsing and values parsing operations. – Vitaliy Fedorchenko Oct 17 '17 at 06:42
I don't have to imagine. I do handle and import large CSVs. Never reading everything in memory. I create a pipeline and process the stream of rows. Memory usage drops from multi-GB to 40 MB, speed goes up because I *don't* have to wait for the disk to load everything before processing. OLAP is an inappropriate example - it's what you get *after* you import and process all files. – Panagiotis Kanavos Oct 17 '17 at 07:01
Are you confusing ETL with OLAP perhaps? – Panagiotis Kanavos Oct 17 '17 at 07:03
PS - You used the *ndex word. No one should use that *ndex word. Not when their "CSVs" contain PGP HEADERS AND FOOTERS. Try importing *that*. – Panagiotis Kanavos Oct 17 '17 at 07:04
@PanagiotisKanavos if you don't know grouping criteria (that might be specified by the user), how you can pre-aggregate multi-GB file to 40MB? It is good that this is possible in your usage scenario (= you can reduce number of columns available for the reports in comparing to original dataset), but in many cases this is not an option. – Vitaliy Fedorchenko Oct 17 '17 at 14:00
The question is about *loading* not preaggregating. It's about ETL, not OLAP – Panagiotis Kanavos Oct 17 '17 at 14:34
That's not clear from the question... especially item (2) "Pretty similar to what a pivot table could do" -- this means OLAP kind of queries, isn't it? If author clarifies that the question is about ETL I can delete my answer as it is about OLAP usage scenario. – Vitaliy Fedorchenko Oct 17 '17 at 14:44
No. It means that you are talking about the *end* result. The OP is still at the starting line. OLAP is the last 10% of any DW project. – Panagiotis Kanavos Oct 17 '17 at 14:46
@Panagiotis Kanavos Please inform what tools you use to handle and import large csv files while never reading everything in memory. – San Jun 27 '21 at 11:39

score 1 · Answer 2 · answered Oct 16 '17 at 09:50

1

Give xsv a shot. It is quite convenient with decent speed. And it fits in the Unix philosopy. However if the dataset is used more than ten times, I'd suggest converting csv to some binary format, and ClickHouse is a good choice for that.

answered Oct 16 '17 at 09:50

Amos

3,238
4
19
41

Best way to process huge .csv

2 Answers2