How to write a large csv file to hdf5 in python?

Question

I have a dataset that is too large to directly read into memory. And I don't want to upgrade the machine. From my readings, HDF5 may be a suitable solution for my problem. But I am not sure how to iteratively write the dataframe into the HDF5 file since I can not load the csv file as a dataframe object.

So my question is how to write a large CSV file into HDF5 file with python pandas.

score 11 · Accepted Answer · answered Oct 07 '17 at 13:11

11

You can read CSV file in chunks using chunksize parameter and append each chunk to the HDF file:

hdf_key = 'hdf_key'
df_cols_to_index = [...] # list of columns (labels) that should be indexed
store = pd.HDFStore(hdf_filename)

for chunk in pd.read_csv(csv_filename, chunksize=500000):
    # don't index data columns in each iteration - we'll do it later ...
    store.append(hdf_key, chunk, data_columns=df_cols_to_index, index=False)
    # index data columns in HDFStore

store.create_table_index(hdf_key, columns=df_cols_to_index, optlevel=9, kind='full')
store.close()

answered Oct 07 '17 at 13:11

MaxU - stand with Ukraine

205,989
36
386
419

Thanks for the answer. I am not familiar with the pytables package. Is it possible to use h5py? – Yan Song Oct 07 '17 at 13:26
Pandas implements it's own HDF API based on `pytables` - we should use that API for compatibility reasons... – MaxU - stand with Ukraine Oct 07 '17 at 13:28
So it's not possible to use h5py? – Yan Song Oct 07 '17 at 13:43
@YanSong, i guess you would need to implement all those functions for reading, writing, appending, updating H5 files using `h5py` module yourself...You may want to read [this](https://stackoverflow.com/a/33644128/5741205)... – MaxU - stand with Ukraine Oct 07 '17 at 13:47
1

@YanSong, but frankly speaking i don't understand what's wrong with using internal Pandas methods, that are based on `pytables` - you don't need to know anything about `pytables` in order to use Pandas HDF methods... – MaxU - stand with Ukraine Oct 07 '17 at 13:49
1

if cols number > 2000, this way will fail – kkkobelief24 Sep 12 '19 at 09:50
2

@G_KOBELIEF please say how the failure presents. Thanks! – DavidC Dec 03 '19 at 16:44

How to write a large csv file to hdf5 in python?

1 Answers1

Linked

Related