1

I'm writing a program that requires running algorithms on a very large (~6GB) csv file, which is loaded with pandas using read_csv(). The issue I have now, is that anytime I tweak my algorithms and need to re-simulate (which is very often), I need to wait ~30s for the dataset to load into memory, and then another 30s afterward to load the same dataset into a graphing module so I can visually see what's going on. Once it's loaded however, operations are done very quickly.

So far I've tried using mmap, and loading the dataset into a RAM disk for access, with no improvement.

I'm hoping to find a way to load up the dataset once into memory with one process, and then access it in memory with the algorithm-crunching process, which gets re-run each time I make a change.

This thread seems to be close-ish to what I need, but uses multiprocessing which needs everything to be run within the same context.

I'm not a computer engineer (I'm electrical :), so I'm not sure what I'm asking for is even possible. Any help would be appreciated however.

Thanks,

clambot
  • 45
  • 5
  • 1
    What is the speed of your hard disk? At 200 MB/s, it'll already take 30 seconds to load the file. How much RAM do you have? Why not let the algorithm and visualization run the in the same process? – Thomas Weller Apr 16 '22 at 17:05
  • Are you updating then re-running the script each time? Aren't you using something interactive like IPython or Jupyter? – Thierry Lathuille Apr 16 '22 at 17:13
  • @ThomasWeller Hard disk is WD Black SN750SE 1TB. Seems it has 3600MB/s. I have 64GB DDR4 RAM. The reason for the separate graphing is that I can load it with different parameters ,and often need to load several graphs at a time. I'm thinking it may be the pandas parsing that is my limiting factor here. – clambot Apr 16 '22 at 17:15
  • @ThierryLathuille Yes, I am updating and then re-running. The program is >10,000 lines long. Is this something I can even run in Jupyter? – clambot Apr 16 '22 at 17:19
  • IMHO, CSVs are slow to parse and very inefficient in terms of space. If you are using the same dataset over and over, I would try to re-cast the data as something more efficient (nearer to binary) and work off that... – Mark Setchell Apr 16 '22 at 17:46
  • For example `123456.789` plus its comma in a CSV is 12 bytes, but as an IEEE754 float it is just 4 bytes. – Mark Setchell Apr 16 '22 at 19:22
  • WOW. @MarkSetchell, I looked into this a bit and stumbled on this helpful article: https://towardsdatascience.com/stop-persisting-pandas-data-frames-in-csvs-f369a6440af5 I tested the parquet and pickle methods... Storing/reading via pickle decreased my load time from 30s to ~2s. It also made the file much smaller. Thanks! – clambot Apr 16 '22 at 20:26

1 Answers1

1

Found a solution that worked, although it was not directly related to my original ask.

Instead of loading a large file into memory and sharing between independent processes, I found that the bottleneck was really the parsing function in pandas library. Particularly, CSV parsing, as CSVs are notoriously inefficient in terms of data storage.

I started storing my files in the python-native pickle format, which is supported by pandas through the to_pickle() and read_pickle() functions. This cut my load times drastically from ~30s to ~2s.

clambot
  • 45
  • 5