Process a LOT of data

Question

So I'm working with parametric energy simulations and ended up with 500GB+ of data stored in .CSV files. I need to be able to process all this data to compare the results and gain insights of the influence of different parameters.

Each csv file name contains information of the parameters used for the simulation so I can not merge the files.

I normally loaded the .csv files to python using pandas and defining a Class. but now (with all this data) there is not enough memory to do this.

Can you point me out a way to process this data? I need to be able to do plots and compare the csv files.

Thank you for your time.

In short, you need lazy evaluation of data. You may want to research on a means of retrieving your data points one by one (or by batch), but naturally this depends on your particular problem (which you haven't explained to us). — E_net4, Oct 01 '16 at 19:27
Maybe helpful: [“Large data” work flows using pandas](https://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas/14268804#14268804) — Brad Solomon, Sep 20 '17 at 02:22

score 0 · Answer 1 · edited May 23 '17 at 10:34

0

Convert the csv files to hdf5, which was created to deal with massive and complex datasets. It works with pandas as well as other libraries.

edited May 23 '17 at 10:34

Community

1
1

answered Sep 30 '16 at 22:29

mr nick

554
2
9

I've converted the files and have separated 120gb hdf5 files. but it takes forever to query. for example: store.keys()[0] takes around 3 minutes. Any idea why? – Paulo Castro Da Silva Oct 05 '16 at 13:41

Process a LOT of data

1 Answers1