0

Which compression algorithms are typically used when compressing HDF5 file that were created without applying any HDF5 compression filters?

My HDF5 files are created with h5py in Python 3.8 and contain N-dimensional numpy arrays of 32-bit floats ranging between -1.0 and 1.0 shapes similar to (1000000,10,200) . Data is being read as (1,10,200) arrays in a random pattern. Chunking the HDF5 datasets appear to make non-continuous/random reading/selection significantly slower, thus chunking was disabled which prevented the use of HDF5 compression filters.

Athena Wisdom
  • 6,101
  • 9
  • 36
  • 60
  • What benchmarks have you performed to verify that the chunking is actually impacting I/O performance? If it is, is it actually significant enough to warrant requiring contiguous data over chunked? Your compression method will largely depend on the nature of the underlying data. Can you provide more details about it? – ESilk Jun 23 '20 at 17:07
  • @esilk No comprehensive benchmarks were performed. Current dataset is a (1000000,10,200) numpy array containing float32 which is being read (1,10,200) at a time in a random pattern. Setting the chunk size to (1,10,200) made the random reads 5-10X slower. Updated question with these details – Athena Wisdom Jun 23 '20 at 17:13
  • I should clarify what I meant about the nature of the data -- I don't just mean its shape and representation. I assume the data isn't just toy data, and is instead measurements of some phenomena. Are there well known methods for compressing it within the domain (JPEG for images, FLAC for audio, etc.)? If not, are there patterns/structures in the data that can be exploited to reduce the representation? – ESilk Jun 23 '20 at 17:23
  • @esilk Thanks, the data has no domain specific compressions that I know of. The arrays contain data that have been calculated off the original measurement data. The original data has been excluded from this dataset. – Athena Wisdom Jun 23 '20 at 17:33
  • I'm not sure I can make many suggestions, then. Compression tends to be specific to the nature of the data, and trying arbitrary techniques may result in no major gains (or even losses): https://en.wikipedia.org/wiki/Lossless_compression#Limitations – ESilk Jun 23 '20 at 17:39
  • There is a nice discussion here: https://stackoverflow.com/a/48405220/10462884. See comments about using `h5py_cache`. More on `h5py_cache` here: https://stackoverflow.com/a/44961222/10462884. Have you considered PyTables (aka tables module)? They have an excellent discussion on optimization, chunking, compression algorithms here: https://www.pytables.org/usersguide/optimization.html. It's worth a read. – kcw78 Jun 23 '20 at 19:55
  • Add an example what you have tried regarding chunking and how you write the dataset (it may be fragmented on a HDD). Is the access pattern really completely random or only "somehow" random? – max9111 Jun 25 '20 at 11:35

0 Answers0