pandas transform a csv into a h5 file avoiding memory error

Question

I have this simple code

data = pd.read_csv(file_path + 'PSI_TS_clean.csv', nrows=None, 
                   names=None, usecols=None)

data.to_hdf(file_path + 'PSI_TS_clean.h5', 'table')

but my data is too big and I run into memory issues.

What is a clean way to do this chunk by chunk?

Which bit the reading or writing? read_csv accepts a `chunksize` param not sure if `to_hdf` does or not — EdChum, May 15 '15 at 10:15
the writing. I think it should be possible to append or something similar — Donbeo, May 15 '15 at 10:24
there is a `mode='a'` according to the docs http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_hdf.html#pandas.DataFrame.to_hdf — EdChum, May 15 '15 at 10:49

score 0 · Answer 1 · edited May 23 '17 at 12:23

0

If the csv is really big split the file using a method such as detailed here : chunking-data-from-a-large-file-for-multiprocessing

then iterate through the files and use pd.read_csv on each then use the pd.to_hdf method

for to_hdf check the parameters here: DataFrame.to_hdf you need to ensure mode 'a' and consider append.

Without knowing further detail about the dataframe structure its difficult to comment further.

also for read_csv there is the param: low_memory=False

edited May 23 '17 at 12:23

Community

1
1

answered May 15 '15 at 23:07

ctrl-alt-delete

3,696
2
24
37

I think there should be a straightforward way to do that with pandas. By the way I have solved using a computer with more RAM – Donbeo May 16 '15 at 12:17
Glad you have solved it. My main data processing machine is 64Gb so I generally don't run into issues. – ctrl-alt-delete May 18 '15 at 06:30
If you add the parameters complib='blosc' and complevel=9 to the to_hdf call you should see dramatically reduced memory use and a significant speedup. – seumas Nov 24 '15 at 12:12
No, those parameters are related to the [Pytables](http://www.pytables.org/usersguide/optimization.html) library which enables the HDF functionality in Pandas. – seumas Nov 24 '15 at 15:51
If your csv file is numeric, in the past I have successfully used [Joe Kington's iter_loadtxt approach in Numpy](http://stackoverflow.com/a/8964779/1135883) to achieve better memory usage. Although this was against a much much earlier version of Pandas ( 0.8.1 ). – seumas Nov 24 '15 at 15:57

pandas transform a csv into a h5 file avoiding memory error

1 Answers1