13

I have read several times that turning on compression in HDF5 can lead to better read/write performance.

I wonder what ideal settings can be to achieve good read/write performance at:

 data_df.to_hdf(..., format='fixed', complib=..., complevel=..., chunksize=...)

I'm already using fixed format (i.e. h5py) as it's faster than table. I have strong processors and do not care much about disk space.

I often store DataFrames of float64 and str types in files of approx. 2500 rows x 9000 columns.

Mark Horvath
  • 1,136
  • 1
  • 9
  • 24
  • 1
    The compression level is basically a tradeoff between processing speed and disk used. If you have fast processors and don't care about disk space, then it shouldn't really matter, just let it use the default. Of course, this is one of those YMMV things where there is no substitute for just trying a couple of different compression levels and seeing what is best on your particular data. Also check the performance reading vs writing on each level as that will not be symmetric. – JohnE Jul 13 '15 at 12:51
  • Default is no compression, I'm pretty sure I can improve on that ;-) I'll have to try myself, but will appreciate good intuition... some compression algorithms are good for speed others for level of compression. Also not sure what chunksize actually influences or if compression actually works on `str` as it is stored as `Object` I believe. I'll also have to run this on several different machines. – Mark Horvath Jul 13 '15 at 13:02
  • My purpose is to improve execution time. And I'm pretty sure I can improve that too by applying compression (e.g. AHL uses lz4, to speed up storing data). – Mark Horvath Jul 13 '15 at 15:30
  • 2
    Right, I think as far as straight execution time is concerned there isn't much substitute for trying different types and levels of compression, though maybe someone else will have some general pointers. As far as strings are concerned, you might also want look at storing them as categorical values. That's roughly equivalent to string compression but will also benefit you while the dataframe is loaded into memory, not just while it's stored – JohnE Jul 13 '15 at 15:35
  • Have found two similar threads ([hdf5 concurrency](http://stackoverflow.com/questions/16628329/hdf5-concurrency-compression-i-o-performance?rq=1) and [pytables write performance](http://stackoverflow.com/questions/20083098/improve-pandas-pytables-hdf5-table-write-performance?rq=1)). Using `blosc` compression seems to reach/beat the performance of no compression in the examples. – Mark Horvath Jul 14 '15 at 09:15

1 Answers1

18

There are a couple of possible compression filters that you could use. Since HDF5 version 1.8.11 you can easily register a 3rd party compression filters.

Regarding performance:

It probably depends on your access pattern because you probably want to define proper dimensions for your chunks so that it aligns well with your access pattern otherwise your performance will suffer a lot. For example if you know that you usually access one column and all rows you should define your chunk shape accordingly (1,9000). See here, here and here for some infos.

However AFAIK pandas usually will end up loading the entire HDF5 file into memory unless you use read_table and an iterator (see here) or do the partial IO yourself (see here) and thus doesn't really benefit that much of defining a good chunk size.

Nevertheless you might still benefit from compression because loading the compressed data to memory and decompressing it using the CPU is probably faster than loading the uncompressed data.

Regarding your original question:

I would recommend to take a look at Blosc. It is a multi-threaded meta-compressor library that supports various different compression filters:

  • BloscLZ: internal default compressor, heavily based on FastLZ.
  • LZ4: a compact, very popular and fast compressor.
  • LZ4HC: a tweaked version of LZ4, produces better compression ratios at the expense of speed.
  • Snappy: a popular compressor used in many places.
  • Zlib: a classic; somewhat slower than the previous ones, but achieving better compression ratios.

These have different strengths and the best thing is to try and benchmark them with your data and see which works best.

q-l-p
  • 4,304
  • 3
  • 16
  • 36
Ümit
  • 17,379
  • 7
  • 55
  • 74
  • Great! The rule of thumb on `chunksize` makes sense, in fact I read up the whole data always since I'm using `fixed` format (I do chunks on the filesystem level). Now I can see why all examples I have found use `blosc`, thanks! – Mark Horvath Jul 14 '15 at 09:54
  • I think the reason why pandas reads the entire file into memory is unrelated to whether you use `fixed` or not. It was designed this way. To do statistics in pandas (`sum`, `mean`) pandas needs to read the entire dataset. You could drop down to `PyTables` which supports queries that won`t read the entire dataset into memory but only chunk by chunk (however you won't have the convenient panda functions). Alternatively for datasets that don't fit into memory [Blaze](http://blaze.pydata.org/en/latest/) might be a good solution. – Ümit Jul 14 '15 at 11:31
  • Wouldn't `chunks=(1, 9000)` mean you access one row and all columns? Since HDF5 is organised in row-major order. – collector Nov 08 '17 at 14:02