1

I want to create a large pd.dataframe, out of 7 files 4GB .txt files, which I want to work with + save to .csv

What I did:

created a for loop and opened-concated one by one on axis=0, and so continuing my index (a timestamp).

However I am running into memory problems, even though I am working on a 100GB Ram server. I read somewhere that pandas takes up 5-10x of the data size.

What are my alternatives?

One is creating an empty csv - opening it + the txt + append a new chunk and saving.

Other ideas?

Mario L
  • 507
  • 1
  • 6
  • 15
  • Check dask for chunked dataframes. Also, you might want to reconsider csv and use a compressed binary format to store the data, you could save some space and save time when reading it back. https://tech.blue-yonder.com/efficient-dataframe-storage-with-apache-parquet/ – Ignacio Vergara Kausel Oct 09 '17 at 07:20

1 Answers1

1

Creating hdf5 file with h5py library will allow you to create one big dataset and access it without loading all the data into the memory.

This answer provides an example of how to create and incrementally increase the hdf5 dataset: incremental writes to hdf5 with h5py