I have a large numpy float array (~4k x 16k float 64) that I want to store on disk. I am trying to understand the differences in the following compression approaches :
1) Use np.save - Save in .npy format and zip this using GZIP (like in one of the answers to Compress numpy arrays efficiently)
f = gzip.GzipFile("my_file.npy.gz", "w")
numpy.save(f, my_array)
f.close()
I get equivalent file sizes if I do the following as well
numpy.save('my_file',my_array)
check_call(['gzip', os.getcwd()+'/my_file.npy'])
2) Write the array into a binary file using tofile(). Close the file and zip this generated binary file using GZIP.
f = open("my_file","wb")
my_array.tofile(f)
f.close()
with open('my_file', 'rb') as f_in:
with gzip.open('my_file.gz', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
The above is a workaround to the following code which does not achieve any compression. This is expected according to the GzipFile docs.
f = gzip.GzipFile("my_file_1.gz", "w")
my_array.tofile(f)
f.close()
Here is my question: The file size using 1) is about 6 times smaller than that using 2). From what I understand in the .npy format, it is the exact same way as a binary file with the exception of some headers etc. to preserve array shape. I don't see any reason as to why the file sizes should differ so drastically.