0

When I create a pandas DataFrame with a column that contains a string and save it as HDF5 the file size seems extremely large. The following code processes a file with a size of 1’063’224 Bytes.

from pathlib import Path
import pandas as pd

data_frame = pd.DataFrame({'foo': ['bar']})
data_frame.to_hdf(Path.home() / 'Desktop' / 'file.hdf5', key='baz', mode='w')

When I replace the 'bar' with a 1 the file size shrinks down to 7’032 Bytes, which seems (more) reasonable.

Does anyone know where that megabyte of data comes from?

reyymo
  • 47
  • 4

1 Answers1

2

The problem is that the dataframe is of dtype object, since strings have variable length that is not permitted in HDF5.

The column can be converted to string with a length parameter:

data_frame['foo'] = data_frame['foo'].astype('|S80') #where the max length is set at 80 bytes

Using this conversion, the file is even smaller than the integer example with 7024 Bytes ?!

Radarrudi
  • 143
  • 6
  • Compare the file schema between the 2 files and you will see why the table format is more compact. The default schema creates multiple datasets for each dataframe, so requires a lot more storage space.OTOH, the table format is simpler and more efficient. – kcw78 Jul 24 '22 at 21:46