Fast way to Hash Numpy objects for Caching

Question

Implementing a system where, when it comes to the heavy mathematical lifting, I want to do as little as possible.

I'm aware that there are issues with memoisation with numpy objects, and as such implemented a lazy-key cache to avoid the whole "Premature optimisation" argument.

def magic(numpyarg,intarg):
    key = str(numpyarg)+str(intarg)

    try:
        ret = self._cache[key]
        return ret
    except:
        pass

    ... here be dragons ...
    self._cache[key]=value
    return value

but since string conversion takes quite a while...

t=timeit.Timer("str(a)","import numpy;a=numpy.random.rand(10,10)")
t.timeit(number=100000)/100000 = 0.00132s/call

What do people suggest as being 'the better way' to do it?

possible duplicate of [How to hash a large object (dataset) in Python?](http://stackoverflow.com/questions/806151/how-to-hash-a-large-object-dataset-in-python) — tacaswell, Mar 19 '14 at 16:43
Note that `str(a)` only shows a part of the array, as would be later pointed out in this comment: https://stackoverflow.com/questions/16589791/most-efficient-property-to-hash-for-numpy-array#comment23847098_16589791 — paperskilltrees, Apr 11 '23 at 21:20

score 28 · Accepted Answer · edited May 23 '17 at 11:46

28

Borrowed from this answer... so really I guess this is a duplicate:

>>> import hashlib
>>> import numpy
>>> a = numpy.random.rand(10, 100)
>>> b = a.view(numpy.uint8)
>>> hashlib.sha1(b).hexdigest()
'15c61fba5c969e5ed12cee619551881be908f11b'
>>> t=timeit.Timer("hashlib.sha1(a.view(numpy.uint8)).hexdigest()", 
                   "import hashlib;import numpy;a=numpy.random.rand(10,10)") 
>>> t.timeit(number=10000)/10000
2.5790500640869139e-05

edited May 23 '17 at 11:46

Community

1
1

answered Mar 22 '11 at 04:21

senderle

145,869
36
209
233

5

Nice! For multidimensional arrays this gives a different hash (for the "same" array) depending on whether it's fortran or c contiguous. If that's an issue, calling `np.ascontiguousarray` should solve it. – jorgeca Jan 27 '14 at 16:00
Not sure why a known slow hash function `sha1` is chosen. SHA-1 is OK for minimising hash collision but poor at speed. For speed you'll need something like `murmurhash` or `xxhash` (the latter claims to be even faster). – Cong Ma Aug 05 '15 at 10:06
@CongMa, thanks for the extra info. There are lots of options! But as you'll notice, this is already two orders of magnitude faster. And speed is never the _only_ concern. It's probably worth using a well-understood hash if the alternative is only a few millionths of a second faster. – senderle Aug 08 '15 at 11:24

score 7 · Answer 2 · edited Sep 16 '21 at 21:40

7

There is a package for this called joblib. Found from this question.

from joblib import Memory
location = './cachedir'
memory = Memory(location)

# Create caching version of magic
magic_cached = memory.cache(magic)
result = magic_cached(...)

# Or (for one-time use)
result = memory.eval(magic, ...)

edited Sep 16 '21 at 21:40

Erik

2,500
2
13
26

answered Mar 22 '11 at 06:55

John Salvatier

3,077
4
26
31

1

It would be better to have a quote from those links copied over in your answer, in case these websites go offline. – Alex Fortin Sep 16 '18 at 18:30

score 2 · Answer 3 · answered Aug 05 '14 at 06:15

2

For small numpy arrays also this might be suitable:

tuple(map(float, a))

if a is the numpy array.

answered Aug 05 '14 at 06:15

Woltan

13,723
15
78
104

Oh yes, tuple is hashable in comparison with list! – Maksym Ganenko Oct 01 '17 at 14:48

Fast way to Hash Numpy objects for Caching

3 Answers3

Linked