making numpy.savez deterministic

Question

I was surprised to find that if you save the same numpy object to file using numpy.savez, the file created is not deterministic. For example,

import numpy
x = numpy.random.rand(1000, 1000)
numpy.savez('foo.npz', x)
numpy.savez('bar.npz', x)

And then

md5sum foo.npz bar.npz

d1b8b7d2000055b8bf62dddc4a5c77b5  foo.npz
1c6e13bb9efca3ec144e81b88b6cdc75  bar.npz

Reading this it looks like it has something to do with the time stamp in the npz zip file.

For testing purposes, I want to verify that the data files that my code creates are identical. I usually do this with a checksum on pickle files, e.g.

import cPickle as pickle
with open('foo.pkl', 'wb') as f:
    pickle.dump(x, f, protocol=2)

with open('bar.pkl', 'wb') as f:
    pickle.dump(x, f, protocol=2)

And then

 md5sum foo.pkl bar.pkl
 3139d9142d57bdde0970013f39b4854f  foo.pkl
 3139d9142d57bdde0970013f39b4854f  bar.pkl

Is there any workaround for doing the same thing with numpy.savez?

With a single array you don't need to use 'savez'. np.save works just as well. In fact it's what 'pickle' uses. — hpaulj, Nov 07 '17 at 22:11
@hpaulj Thanks. I didn't know that. But in my use case I am actually doing the likes of `numpy.savez('foo.npz', x, y ... z)`, i.e. saving many objects to one npz file. And also, I just checked, and it seems like `numpy.save` is also creating zip files, and in any case, has the same problem mentioned above with `savez`. — mjandrews, Nov 07 '17 at 22:17

score 2 · Accepted Answer · answered Nov 17 '17 at 12:28

In case you're indeed not passing keyword arguments to np.savez (i.e. indeed only serializing your data, without wanting to reference the items later based on keys), you can get away with dumping multiple arrays into the same file with np.save:

import numpy as np
import time

def mysavez(outfile,*args):
    with open(outfile,'wb') as outf:
        for arg in args:
            np.save(outf,arg)

x = np.random.rand(1000,1000)

# control group
np.savez('foo.npz', *[x]*5)
time.sleep(2) # make sure there's a difference in timestamp
np.savez('bar.npz', *[x]*5)

# new one
mysavez('foo.nopz', *[x]*5)
time.sleep(2) # make sure there's a difference in timestamp
mysavez('bar.nopz', *[x]*5)

The resulting new files have the same hash, and they even have the exact same size as the originals:

$ md5sum foo.npz bar.npz
4d21c47903b4ffab945f619ad5b6f471  foo.npz
f9af863c6178765d6dc32a5fa2f63623  bar.npz
$ md5sum foo.nopz bar.nopz
c8504f0d8cc53956100912efb02573b0  foo.nopz
c8504f0d8cc53956100912efb02573b0  bar.nopz
$ du {foo,bar}.n*pz
39064   foo.nopz
39064   foo.npz
39064   bar.nopz
39064   bar.npz

As long as you're sequentially reading variables from the file you won't notice a functional difference. Of course you'll need a myload to go with it that yields the saved arrays until they're all gone (or be extra fancy and save an initial integer header telling you the number of arrays saved to the file). This approach is admittedly kludgy, but it might cut it depending on your exact use case.

In case you do want to access your saved variables using keys, you could still consider writing an auxiliary function for testing which reads the "production" .npz files, iterates over their ordered keys, saves them sequentially using the above mysavez function, then computes the hash of these "flattened" pickle files. Of course you might not need np.save for this: pickle can do the same for you (although cpickle might not).

Praveen Kulkarni · Answer 2 · 2019-09-16T10:23:44.027

This can be achieved by numpy records/structures, with the only limitation that first dimension of every array must be same.

Below is the code snippet for making it work with numpy records, so that you can even use kwargs.

import numpy as np
import time
import hashlib

def md5(fname):
    hash_md5 = hashlib.md5()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

def mysavez(outfile,**kwargs):
    # sort the keys
    _sorted_keys = list(kwargs.keys())
    _sorted_keys.sort()

    # len of first element ... and check if it is same for all elements
    _len = kwargs[_sorted_keys[0]].shape[0]
    for k, v in kwargs.items():
        if v.shape[0] != _len:
            raise Exception(
                f"While creating numpy struct all arrays must have same length."
                f"invalid shape {v.shape} for item {k}"
            )

    # create numpy record buffer
    npy_record = np.zeros(
        _len,
        dtype=[
            (k, kwargs[k].dtype,  kwargs[k].shape[1:])
            for k in _sorted_keys
        ],
    )

    # fill up the elements
    for k, v in kwargs.items():
        npy_record[k] = v

    # save
    with open(outfile, 'wb') as outf:
        np.save(outf, npy_record)

a = np.random.rand(1000,1000)
b = np.random.rand(1000,1000)

# new one
mysavez('foo.nopz', a=a, b=b)
time.sleep(2) # make sure there's a difference in timestamp
mysavez('bar.nopz', a=a, b=b)

# check hash
print('foo.nopz', md5('foo.nopz'))
print('bar.nopz', md5('bar.nopz'))

Output:

foo.nopz b7759b6a60f135c393954e530fb5604b
bar.nopz b7759b6a60f135c393954e530fb5604b

making numpy.savez deterministic

2 Answers2