Compressing numpy float arrays

Question

I have a large numpy float array (~4k x 16k float 64) that I want to store on disk. I am trying to understand the differences in the following compression approaches :

1) Use np.save - Save in .npy format and zip this using GZIP (like in one of the answers to Compress numpy arrays efficiently)

f = gzip.GzipFile("my_file.npy.gz", "w")
numpy.save(f, my_array)
f.close()

I get equivalent file sizes if I do the following as well

numpy.save('my_file',my_array)
check_call(['gzip', os.getcwd()+'/my_file.npy'])

2) Write the array into a binary file using tofile(). Close the file and zip this generated binary file using GZIP.

f = open("my_file","wb")
my_array.tofile(f)
f.close()
with open('my_file', 'rb') as f_in:
   with gzip.open('my_file.gz', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

The above is a workaround to the following code which does not achieve any compression. This is expected according to the GzipFile docs.

f = gzip.GzipFile("my_file_1.gz", "w")
my_array.tofile(f)
f.close()

Here is my question: The file size using 1) is about 6 times smaller than that using 2). From what I understand in the .npy format, it is the exact same way as a binary file with the exception of some headers etc. to preserve array shape. I don't see any reason as to why the file sizes should differ so drastically.

I know this is not what you're asking, but 2) would also be less portable due to `.tofile`, so since that method is both more fragile _and_ results in a larger file, this should only be more of an academic interest. And I suspect your workaround could also be `f.write(my_array.tobytes())`. — Andras Deak -- Слава Україні, Sep 18 '19 at 22:41
Due to machine-specific details such as endianness, see https://docs.scipy.org/doc/numpy/reference/generated/numpy.lib.format.html and https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.tofile.html — Andras Deak -- Слава Україні, Sep 20 '19 at 06:42
Sorry, I just reread the question in your comment. `.tofile` is the same as `tobytes`, which is what I tried saying in my first comment. My point was that `.tofile`/`.tobytes` is less portable than `.save`. — Andras Deak -- Слава Україні, Sep 20 '19 at 09:41

Compressing numpy float arrays

0 Answers0