1

I have this simple code

data = pd.read_csv(file_path + 'PSI_TS_clean.csv', nrows=None, 
                   names=None, usecols=None)

data.to_hdf(file_path + 'PSI_TS_clean.h5', 'table')

but my data is too big and I run into memory issues.

What is a clean way to do this chunk by chunk?

EdChum
  • 376,765
  • 198
  • 813
  • 562
Donbeo
  • 17,067
  • 37
  • 114
  • 188
  • Which bit the reading or writing? read_csv accepts a `chunksize` param not sure if `to_hdf` does or not – EdChum May 15 '15 at 10:15
  • the writing. I think it should be possible to append or something similar – Donbeo May 15 '15 at 10:24
  • there is a `mode='a'` according to the docs http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_hdf.html#pandas.DataFrame.to_hdf – EdChum May 15 '15 at 10:49

1 Answers1

0

If the csv is really big split the file using a method such as detailed here : chunking-data-from-a-large-file-for-multiprocessing

then iterate through the files and use pd.read_csv on each then use the pd.to_hdf method

for to_hdf check the parameters here: DataFrame.to_hdf you need to ensure mode 'a' and consider append.

Without knowing further detail about the dataframe structure its difficult to comment further.

also for read_csv there is the param: low_memory=False

Community
  • 1
  • 1
ctrl-alt-delete
  • 3,696
  • 2
  • 24
  • 37
  • I think there should be a straightforward way to do that with pandas. By the way I have solved using a computer with more RAM – Donbeo May 16 '15 at 12:17
  • Glad you have solved it. My main data processing machine is 64Gb so I generally don't run into issues. – ctrl-alt-delete May 18 '15 at 06:30
  • If you add the parameters complib='blosc' and complevel=9 to the to_hdf call you should see dramatically reduced memory use and a significant speedup. – seumas Nov 24 '15 at 12:12
  • No, those parameters are related to the [Pytables](http://www.pytables.org/usersguide/optimization.html) library which enables the HDF functionality in Pandas. – seumas Nov 24 '15 at 15:51
  • If your csv file is numeric, in the past I have successfully used [Joe Kington's iter_loadtxt approach in Numpy](http://stackoverflow.com/a/8964779/1135883) to achieve better memory usage. Although this was against a much much earlier version of Pandas ( 0.8.1 ). – seumas Nov 24 '15 at 15:57