Pandas DataFrame saved as HDF5 Files are extremely large when containing a column with a string values

Question

When I create a pandas DataFrame with a column that contains a string and save it as HDF5 the file size seems extremely large. The following code processes a file with a size of 1’063’224 Bytes.

from pathlib import Path
import pandas as pd

data_frame = pd.DataFrame({'foo': ['bar']})
data_frame.to_hdf(Path.home() / 'Desktop' / 'file.hdf5', key='baz', mode='w')

When I replace the 'bar' with a 1 the file size shrinks down to 7’032 Bytes, which seems (more) reasonable.

Does anyone know where that megabyte of data comes from?

Radarrudi · Accepted Answer · 2022-07-24T22:00:50.870

2

The problem is that the dataframe is of dtype object, since strings have variable length that is not permitted in HDF5.

The column can be converted to string with a length parameter:

data_frame['foo'] = data_frame['foo'].astype('|S80') #where the max length is set at 80 bytes

Using this conversion, the file is even smaller than the integer example with 7024 Bytes ?!

edited Jul 24 '22 at 22:00

answered Jul 24 '22 at 21:34

Radarrudi

143
6

Compare the file schema between the 2 files and you will see why the table format is more compact. The default schema creates multiple datasets for each dataframe, so requires a lot more storage space.OTOH, the table format is simpler and more efficient. – kcw78 Jul 24 '22 at 21:46

Pandas DataFrame saved as HDF5 Files are extremely large when containing a column with a string values

1 Answers1