Effective compression with h5py for dataset with a lot of repetitions

Question

I have HDF5 files that consist numpy.array with dim = (N, M, Q), where N - is a number of such matricies. The main property of them that values are represented as a power of two and have a lot of repeition, so what I definitely mean:

[[0,2,4,16,1024], [2,4,16,512,128], [4,16,128,0,2048] ...]

And I'm looking for good compression. I tested gzip and bzip2, but it seems they are nood good choice in this case. It seems that I need some compression with customer dictionary or something that can really compress good such datasets. I don't have good grasp of filters and compressers, so I decided to ask it while I'm reading different resources about it.

If you have any experience or you have any ideas/recommendations, I'll be very thankful for your help!

Thanks in advance!

Which compression ratio do you want to achieve? You can for example try https://stackoverflow.com/a/56761075/4045774 with clevel=9 (BLOSC is also available in h5py via a workaround) Of course chunk-shape has also an influence on the result. — max9111, Mar 24 '21 at 14:17
@max9111, hi i want to reach compression rate about 20 or more, at thic moment the best result is 12 (bzip2, bitshuffle). I decided to test all possible compressions, but I am far from desires compression rate. — kitsune_breeze, Mar 25 '21 at 12:05

Effective compression with h5py for dataset with a lot of repetitions

0 Answers0