I'm using pandas HDFStore to store data in hfd5 files.
Typically, data is appended one sample at a time, rather than by long batches.
I noticed the files are growing pretty fast and I can reduce them substantially with ptrepack.
Here's an example with a small file. The file generated by my application (using zlib and complevel 9) is 6.7 MB big.
/ (RootGroup) ''
/test (Group) ''
/test/table (Table(1042,), shuffle, zlib(1)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(2,), dflt=0.0, pos=1),
"values_block_1": Int64Col(shape=(1,), dflt=0, pos=2)}
byteorder := 'little'
chunkshape := (2048,)
autoindex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
If I ptrepack it with no option, it gets dramatically smaller (71K):
/ (RootGroup) ''
/test (Group) ''
/test/table (Table(1042,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(2,), dflt=0.0, pos=1),
"values_block_1": Int64Col(shape=(1,), dflt=0, pos=2)}
byteorder := 'little'
chunkshape := (2048,)
When using --complevel=1
or --complevel=9
, I get a 19K file.
/ (RootGroup) ''
/test (Group) ''
/test/table (Table(1042,), shuffle, zlib(1)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(2,), dflt=0.0, pos=1),
"values_block_1": Int64Col(shape=(1,), dflt=0, pos=2)}
byteorder := 'little'
chunkshape := (2048,)
/ (RootGroup) ''
/test (Group) ''
/test/table (Table(1042,), shuffle, zlib(9)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(2,), dflt=0.0, pos=1),
"values_block_1": Int64Col(shape=(1,), dflt=0, pos=2)}
byteorder := 'little'
chunkshape := (2048,)
These are small files, but my point is that I could downsize the whole 35GB database to a few hundreds MB just by repacking it.
There has to be something wrong in the way it is written.
I know about the "hdf5 does not reclaim space" warning. The normal use case does not involve deleting data, or maybe marginally.
To append new data, I use
store.append(data_id, data_dataframe)
so I only append. I don't delete/write the whole data.
I noticed a difference in the dumps above
autoindex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
but I don't know what to conclude about it.
I suspect the size issue could be due to the fact that the samples are added one at a time. But I don't see exactly why this should be a problem. The whole chunk should be compressed when even a small amount of data is added.
Or is it because each time a chunk is modified, it is written on another space and the old chunk space is lost?
In this case, I guess my options are:
Modify the application so that data is written by batches. Maybe by adding a caching layer. Practically impossible. I might as well change the underlying database.
Pick a much lower chunk size. But this has downsides as well.
Set a script to ptrepack the data regularly.