How to handle extremely large data sets in pandas

Question

I need to merge 5 collections in a MongoDB on a couple of field names & return it as a CSV file. I can read the collections into pandas using the from_records method no problem & merge a subset of these using pd.merge but the issue is each data frame I want to merge has 20,000+ columns & 100,000+ rows. The merging process is obviously extremely slow due to size.

I've never dealt with data on this order of magnitude -- what are some ways I can speed up this process? Maybe pandas isn't the right tool to use at this point?

If you are looking for scalable solutions you should probably take a look to dask https://dask.pydata.org/en/latest/ . Other solution could be to change the approach you are taking and use other formats such as HDF5 — horro, Jul 17 '18 at 14:22
Yes, is there a way I can not read the data into memory when doing the merges? — e1v1s, Jul 17 '18 at 14:22

score 1 · Accepted Answer · answered Jul 17 '18 at 14:31

I guess you need to distribute your processing.

One way to do this is to split your input to multiple chunks, use multiprocessing to generate intermediate outputs, then combine them all at the end.

How do I do this in pandas ?

"Large data" work flows using pandas

How to handle extremely large data sets in pandas

1 Answers1