How to hash a large object (dataset) in Python?

Question

I would like to calculate a hash of a Python class containing a dataset for Machine Learning. The hash is meant to be used for caching, so I was thinking of md5 or sha1. The problem is that most of the data is stored in NumPy arrays; these do not provide a __hash__() member. Currently I do a pickle.dumps() for each member and calculate a hash based on these strings. However, I found the following links indicating that the same object could lead to different serialization strings:

What would be the best method to calculate a hash for a Python class containing Numpy arrays?

Not much of a seasoned python programmer but, would serializing the object and hashing that work? — lsl, Apr 30 '09 at 09:57

score 34 · Answer 1 · answered Apr 30 '09 at 10:42

34

Thanks to John Montgomery I think I have found a solution, and I think it has less overhead than converting every number in possibly huge arrays to strings:

I can create a byte-view of the arrays and use these to update the hash. And somehow this seems to give the same digest as directly updating using the array:

>>> import hashlib
>>> import numpy
>>> a = numpy.random.rand(10, 100)
>>> b = a.view(numpy.uint8)
>>> print a.dtype, b.dtype # a and b have a different data type
float64 uint8
>>> hashlib.sha1(a).hexdigest() # byte view sha1
'794de7b1316b38d989a9040e6e26b9256ca3b5eb'
>>> hashlib.sha1(b).hexdigest() # array sha1
'794de7b1316b38d989a9040e6e26b9256ca3b5eb'

answered Apr 30 '09 at 10:42

Will you be able to re-create the object from the cache with this technique? It seems like you will only be able to get an array of type uint8 back (sacrificing the accuracy in your array). – tgray Apr 30 '09 at 16:52
Using John Montgomery's solution, it looks like you would get back a float64 array. – tgray Apr 30 '09 at 17:00
@tgray: Sometimes it doesn't matter all that much what the accuracy is. Experimental data, especially large ones, tend to have large uncertainties anyways. Obviously this is subject to context, but the general rule is that double precision is important for the calculation, not for storing the data or the final answer. – Tim Lin Apr 30 '09 at 21:27

xioxox · Answer 2 · 2016-01-28T10:56:14.963

3

Using Numpy 1.10.1 and python 2.7.6, you can now simply hash numpy arrays using hashlib if the array is C-contiguous (use numpy.ascontiguousarray() if not), e.g.

>>> h = hashlib.md5()
>>> arr = numpy.arange(101)
>>> h.update(arr)
>>> print(h.hexdigest())
e62b430ff0f714181a18ea1a821b0918

edited Jan 28 '16 at 10:56

answered Jan 28 '16 at 10:50

xioxox

2,526
1
22
22

score 3 · Answer 3 · edited May 23 '17 at 11:46

3

There is a package for memoizing functions that use numpy arrays as inputs joblib. Found from this question.

edited May 23 '17 at 11:46

Community

1
1

answered Mar 28 '11 at 21:27

John Salvatier

3,077
4
26
31

score 3 · Answer 4 · edited Apr 30 '09 at 10:24

What's the format of the data in the arrays? Couldn't you just iterate through the arrays, convert them into a string (via some reproducible means) and then feed that into your hash via update?

e.g.

import hashlib
m = hashlib.md5() # or sha1 etc
for value in array: # array contains the data
    m.update(str(value))

Don't forget though that numpy arrays won't provide __hash__() because they are mutable. So be careful not to modify the arrays after your calculated your hash (as it will no longer be the same).

score 2 · Answer 5 · answered Mar 28 '11 at 19:14

Here is how I do it in jug (git HEAD at the time of this answer):

e = some_array_object
M = hashlib.md5()
M.update('np.ndarray')
M.update(pickle.dumps(e.dtype))
M.update(pickle.dumps(e.shape))
try:
    buffer = e.data
    M.update(buffer)
except:
    M.update(e.copy().data)

The reason is that e.data is only available for some arrays (contiguous arrays). Same thing with a.view(np.uint8) (which fails with a non-descriptive type error if the array is not contiguous).

score 1 · Answer 6 · edited Feb 03 '23 at 11:27

1

Fastest by some margin seems to be:

>>> hash(iter(a))

a is a numpy ndarray.

Obviously not secure hashing, but it should be good for caching etc.

edited Feb 03 '23 at 11:27

tshepang

12,111
21
91
136

answered Oct 18 '13 at 19:08

user2896082

21
1

Under python 3.3+ you won't get the same hash values on different runs of python, due to security improvements. – xioxox Jan 28 '16 at 11:01

score 0 · Answer 7 · answered Aug 22 '09 at 08:29

0

array.data is always hashable, because it's a buffer object. easy :) (unless you care about the difference between differently-shaped arrays with the exact same data, etc.. (ie this is suitable unless shape, byteorder, and other array 'parameters' must also figure into the hash)

answered Aug 22 '09 at 08:29

1

array.data is not hashable as of numpy 1.6.2 and python 2.7 – yoavram May 09 '13 at 13:38

How to hash a large object (dataset) in Python?

7 Answers7

Linked