Is it possible in Python to load a large object into memory with one process, and access it in separate independent processes?

Question

I'm writing a program that requires running algorithms on a very large (~6GB) csv file, which is loaded with pandas using read_csv(). The issue I have now, is that anytime I tweak my algorithms and need to re-simulate (which is very often), I need to wait ~30s for the dataset to load into memory, and then another 30s afterward to load the same dataset into a graphing module so I can visually see what's going on. Once it's loaded however, operations are done very quickly.

So far I've tried using mmap, and loading the dataset into a RAM disk for access, with no improvement.

I'm hoping to find a way to load up the dataset once into memory with one process, and then access it in memory with the algorithm-crunching process, which gets re-run each time I make a change.

This thread seems to be close-ish to what I need, but uses multiprocessing which needs everything to be run within the same context.

I'm not a computer engineer (I'm electrical :), so I'm not sure what I'm asking for is even possible. Any help would be appreciated however.

Thanks,

What is the speed of your hard disk? At 200 MB/s, it'll already take 30 seconds to load the file. How much RAM do you have? Why not let the algorithm and visualization run the in the same process? — Thomas Weller, Apr 16 '22 at 17:05
Are you updating then re-running the script each time? Aren't you using something interactive like IPython or Jupyter? — Thierry Lathuille, Apr 16 '22 at 17:13
@ThomasWeller Hard disk is WD Black SN750SE 1TB. Seems it has 3600MB/s. I have 64GB DDR4 RAM. The reason for the separate graphing is that I can load it with different parameters ,and often need to load several graphs at a time. I'm thinking it may be the pandas parsing that is my limiting factor here. — clambot, Apr 16 '22 at 17:15
@ThierryLathuille Yes, I am updating and then re-running. The program is >10,000 lines long. Is this something I can even run in Jupyter? — clambot, Apr 16 '22 at 17:19
IMHO, CSVs are slow to parse and very inefficient in terms of space. If you are using the same dataset over and over, I would try to re-cast the data as something more efficient (nearer to binary) and work off that... — Mark Setchell, Apr 16 '22 at 17:46
For example `123456.789` plus its comma in a CSV is 12 bytes, but as an IEEE754 float it is just 4 bytes. — Mark Setchell, Apr 16 '22 at 19:22
WOW. @MarkSetchell, I looked into this a bit and stumbled on this helpful article: https://towardsdatascience.com/stop-persisting-pandas-data-frames-in-csvs-f369a6440af5 I tested the parquet and pickle methods... Storing/reading via pickle decreased my load time from 30s to ~2s. It also made the file much smaller. Thanks! — clambot, Apr 16 '22 at 20:26

score 1 · Accepted Answer · answered Apr 19 '22 at 02:17

Found a solution that worked, although it was not directly related to my original ask.

Instead of loading a large file into memory and sharing between independent processes, I found that the bottleneck was really the parsing function in pandas library. Particularly, CSV parsing, as CSVs are notoriously inefficient in terms of data storage.

I started storing my files in the python-native pickle format, which is supported by pandas through the to_pickle() and read_pickle() functions. This cut my load times drastically from ~30s to ~2s.

Is it possible in Python to load a large object into memory with one process, and access it in separate independent processes?

1 Answers1