2

I'm using pandas HDFStore to store data in hfd5 files.

Typically, data is appended one sample at a time, rather than by long batches.

I noticed the files are growing pretty fast and I can reduce them substantially with ptrepack.

Here's an example with a small file. The file generated by my application (using zlib and complevel 9) is 6.7 MB big.

/ (RootGroup) ''
/test (Group) ''
/test/table (Table(1042,), shuffle, zlib(1)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(2,), dflt=0.0, pos=1),
  "values_block_1": Int64Col(shape=(1,), dflt=0, pos=2)}
  byteorder := 'little'
  chunkshape := (2048,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}

If I ptrepack it with no option, it gets dramatically smaller (71K):

/ (RootGroup) ''
/test (Group) ''
/test/table (Table(1042,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(2,), dflt=0.0, pos=1),
  "values_block_1": Int64Col(shape=(1,), dflt=0, pos=2)}
  byteorder := 'little'
  chunkshape := (2048,)

When using --complevel=1 or --complevel=9, I get a 19K file.

/ (RootGroup) ''
/test (Group) ''
/test/table (Table(1042,), shuffle, zlib(1)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(2,), dflt=0.0, pos=1),
  "values_block_1": Int64Col(shape=(1,), dflt=0, pos=2)}
  byteorder := 'little'
  chunkshape := (2048,)


/ (RootGroup) ''
/test (Group) ''
/test/table (Table(1042,), shuffle, zlib(9)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(2,), dflt=0.0, pos=1),
  "values_block_1": Int64Col(shape=(1,), dflt=0, pos=2)}
  byteorder := 'little'
  chunkshape := (2048,)

These are small files, but my point is that I could downsize the whole 35GB database to a few hundreds MB just by repacking it.

There has to be something wrong in the way it is written.

I know about the "hdf5 does not reclaim space" warning. The normal use case does not involve deleting data, or maybe marginally.

To append new data, I use

store.append(data_id, data_dataframe)

so I only append. I don't delete/write the whole data.

I noticed a difference in the dumps above

  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}

but I don't know what to conclude about it.

I suspect the size issue could be due to the fact that the samples are added one at a time. But I don't see exactly why this should be a problem. The whole chunk should be compressed when even a small amount of data is added.

Or is it because each time a chunk is modified, it is written on another space and the old chunk space is lost?

In this case, I guess my options are:

  • Modify the application so that data is written by batches. Maybe by adding a caching layer. Practically impossible. I might as well change the underlying database.

  • Pick a much lower chunk size. But this has downsides as well.

  • Set a script to ptrepack the data regularly.

Jérôme
  • 13,328
  • 7
  • 56
  • 106
  • Related: [pandas pytables append: performance and increase in file size](https://stackoverflow.com/questions/22934996/pandas-pytables-append-performance-and-increase-in-file-size) – jpp Jul 11 '18 at 12:24
  • 1
    Thanks @jpp, I saw this already. I don't mind the speed. I don't think the `min_item_size` recommendation applies as I'm not merging different files but appending to the same file. – Jérôme Jul 11 '18 at 15:16

0 Answers0