Python Pandas: Processing large csv files in chunks while simultaneously query for specific values

Asked Jul 28 '20 at 06:04

Active Jul 28 '20 at 06:04

Viewed 105 times

I need to read a very large (~30GB) .csv file and query for specific values of my interest. I tested out the query code on a small dummy file and it worked, but I get a memory error message when I tried on the actual large file. I think the strategy is to not have all the data read at once, rather processing it in chunks, but I have no experience coding so I don't know how to do it.

Here's my reading in the very large data and then query function:

synapses = pd.read_csv('c:/Users/anhdu/OneDrive/Desktop/Synapses FlyWire/flywire_buhman_wiring_v7.csv')
synapses_wanted = synapses.query('pre_pt_root_id == 720575940631147000 & post_pt_root_id == 720575940622342000 ')

I'm wondering if somebody could please help me with example codes to do the above, but in chunks so that my computer can handle it. The file has ~30 million rows. Many thanks!

asked Jul 28 '20 at 06:04

Mike Le

Probably `dask` might help you – bigbounty Jul 28 '20 at 06:06
you could dump it into an sqlite database, filter, before reading into Pandas. Or use the command line - maybe ``awk`` – sammywemmy Jul 28 '20 at 06:06
https://stackoverflow.com/questions/13651117/how-can-i-filter-lines-on-load-in-pandas-read-csv-function this pattern is pretty simple and allows you to chunk and filter – Rob Raymond Jul 28 '20 at 06:11

Python Pandas: Processing large csv files in chunks while simultaneously query for specific values

0 Answers0