Just adding an alternative that easily provided me with the highest compression ratio and on top of that did it so fast I was sure I made a mistake somewhere (I didn't). The real bonus is that the decompression is also very fast, so any program that reads in lots of preprocessed data, for example, will benefit hugely from this. One potential caveat is that there is mention of "small arrays (<2GB)" somewhere here, but it looks like there are ways around that. Or, if you're lazy like me, breaking up your data instead is usually an option.
Some smart cookies came up with python-blosc. It's a "high performance compressor", according to their docs. I was lead to it from an answer to this question.
Once installed via, e.g. pip install blosc
or conda install python-blosc
, you can compress pickled data pretty easily as follows:
import blosc
import numpy as np
import pickle
data = np.random.rand(3, 3, 1e7)
pickled_data = pickle.dumps(data) # returns data as a bytes object
compressed_pickle = blosc.compress(pickled_data)
with open("path/to/file/test.dat", "wb") as f:
f.write(compressed_pickle)
And to read it:
with open("path/to/file/test.dat", "rb") as f:
compressed_pickle = f.read()
depressed_pickle = blosc.decompress(compressed_pickle)
data = pickle.loads(depressed_pickle) # turn bytes object back into data
I'm using Python 3.7 and without even looking at all the different compression options I got a compression ratio of about 12 and reading + decompressing + loading the compressed pickle file took a fraction of a second longer than loading the uncompressed pickle file.
I wrote this more as a reference for myself, but I hope someone else will find this useful.
Peace oot