16

I need to store a 512^3 array on disk in some way and I'm currently using HDF5. Since the array is sparse a lot of disk space gets wasted.

Does HDF5 provide any support for sparse array ?

andreabedini
  • 1,295
  • 1
  • 13
  • 20

3 Answers3

18

One workaround is to create the dataset with a compression option. For example, in Python using h5py:

import h5py
f = h5py.File('my.h5', 'w')
d = f.create_dataset('a', dtype='f', shape=(512, 512, 512), fillvalue=-999.,
                     compression='gzip', compression_opts=9)
d[3, 4, 5] = 6
f.close()

The resulting file is 4.5 KB. Without compression, this same file would be about 512 MB. That's a compression of 99.999%, because most of the data are -999. (or whatever fillvalue you want).


The equivalent can be achieved using the C++ HDF5 API by setting H5::DSetCreatPropList::setDeflate to 9, with an example shown in h5group.cpp.

Mike T
  • 41,085
  • 18
  • 152
  • 203
  • Although the analysis is indeed done in python, the hdf5 file is generated in C++ so h5py is not an option. Is the same kind of compression supported natively by hdf5? I know pytables and h5py support additional compression protocols. – andreabedini Sep 28 '14 at 06:19
  • 1
    @andreabedini I've updated the answer with a link to a C++ example that does the same technique. I do believe that the dataset must be chunked to enable compression. – Mike T Sep 28 '14 at 20:58
  • From the [HDF5 link](http://docs.h5py.org/en/latest/high/dataset.html#chunked-storage) at the start of the answer: "Chunked storage makes it possible to resize datasets, and because the data is stored in fixed-size chunks, to use compression filters." So, yep, chunking required for compression. – hBy2Py Jul 06 '15 at 17:48
  • What does the `compression_opts` do in the above code? – Rama Apr 13 '17 at 10:04
  • 1
    @Rama 9 is the maximum compression level; see [the docs](http://docs.h5py.org/en/latest/high/dataset.html#lossless-compression-filters) – Mike T Apr 14 '17 at 01:14
3

Chunked datasets (H5D_CHUNKED) allow sparse storage but depending on your data, the overhead may be important.

Take a typical array and try both sparse and non-sparse and then compare the file sizes, then you will see if it is really worth.

Simon
  • 31,675
  • 9
  • 80
  • 92
  • 1
    yes, [this](http://mail.hdfgroup.org/pipermail/hdf-forum_hdfgroup.org/2010-March/002704.html) post explains how to do it (or perhaps how _not_ to do it) thanks – andreabedini Mar 02 '11 at 03:36
1

HDF5 provides indexed storage: http://www.hdfgroup.org/HDF5/doc/TechNotes/RawDStorage.html

Alexandre C.
  • 55,948
  • 11
  • 128
  • 197
  • hi, Ím not really familiar with how hdf5 works internally, how can I store raw data in a hdf5 file ? does that mean I can bypass the Table datatype and write my own structures ? – andreabedini Mar 02 '11 at 03:41