0

So I'm working with parametric energy simulations and ended up with 500GB+ of data stored in .CSV files. I need to be able to process all this data to compare the results and gain insights of the influence of different parameters.

Each csv file name contains information of the parameters used for the simulation so I can not merge the files.

I normally loaded the .csv files to python using pandas and defining a Class. but now (with all this data) there is not enough memory to do this.

Can you point me out a way to process this data? I need to be able to do plots and compare the csv files.

Thank you for your time.

Community
  • 1
  • 1
  • In short, you need lazy evaluation of data. You may want to research on a means of retrieving your data points one by one (or by batch), but naturally this depends on your particular problem (which you haven't explained to us). – E_net4 Oct 01 '16 at 19:27
  • Maybe helpful: [“Large data” work flows using pandas](https://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas/14268804#14268804) – Brad Solomon Sep 20 '17 at 02:22

1 Answers1

0

Convert the csv files to hdf5, which was created to deal with massive and complex datasets. It works with pandas as well as other libraries.

Community
  • 1
  • 1
mr nick
  • 554
  • 2
  • 9
  • I've converted the files and have separated 120gb hdf5 files. but it takes forever to query. for example: store.keys()[0] takes around 3 minutes. Any idea why? – Paulo Castro Da Silva Oct 05 '16 at 13:41