24

Implementing a system where, when it comes to the heavy mathematical lifting, I want to do as little as possible.

I'm aware that there are issues with memoisation with numpy objects, and as such implemented a lazy-key cache to avoid the whole "Premature optimisation" argument.

def magic(numpyarg,intarg):
    key = str(numpyarg)+str(intarg)

    try:
        ret = self._cache[key]
        return ret
    except:
        pass

    ... here be dragons ...
    self._cache[key]=value
    return value

but since string conversion takes quite a while...

t=timeit.Timer("str(a)","import numpy;a=numpy.random.rand(10,10)")
t.timeit(number=100000)/100000 = 0.00132s/call

What do people suggest as being 'the better way' to do it?

Bolster
  • 7,460
  • 13
  • 61
  • 96
  • 1
    possible duplicate of [How to hash a large object (dataset) in Python?](http://stackoverflow.com/questions/806151/how-to-hash-a-large-object-dataset-in-python) – tacaswell Mar 19 '14 at 16:43
  • Note that `str(a)` only shows a part of the array, as would be later pointed out in this comment: https://stackoverflow.com/questions/16589791/most-efficient-property-to-hash-for-numpy-array#comment23847098_16589791 – paperskilltrees Apr 11 '23 at 21:20

3 Answers3

28

Borrowed from this answer... so really I guess this is a duplicate:

>>> import hashlib
>>> import numpy
>>> a = numpy.random.rand(10, 100)
>>> b = a.view(numpy.uint8)
>>> hashlib.sha1(b).hexdigest()
'15c61fba5c969e5ed12cee619551881be908f11b'
>>> t=timeit.Timer("hashlib.sha1(a.view(numpy.uint8)).hexdigest()", 
                   "import hashlib;import numpy;a=numpy.random.rand(10,10)") 
>>> t.timeit(number=10000)/10000
2.5790500640869139e-05
Community
  • 1
  • 1
senderle
  • 145,869
  • 36
  • 209
  • 233
  • 5
    Nice! For multidimensional arrays this gives a different hash (for the "same" array) depending on whether it's fortran or c contiguous. If that's an issue, calling `np.ascontiguousarray` should solve it. – jorgeca Jan 27 '14 at 16:00
  • Not sure why a known slow hash function `sha1` is chosen. SHA-1 is OK for minimising hash collision but poor at speed. For speed you'll need something like `murmurhash` or `xxhash` (the latter claims to be even faster). – Cong Ma Aug 05 '15 at 10:06
  • @CongMa, thanks for the extra info. There are lots of options! But as you'll notice, this is already two orders of magnitude faster. And speed is never the _only_ concern. It's probably worth using a well-understood hash if the alternative is only a few millionths of a second faster. – senderle Aug 08 '15 at 11:24
7

There is a package for this called joblib. Found from this question.

from joblib import Memory
location = './cachedir'
memory = Memory(location)

# Create caching version of magic
magic_cached = memory.cache(magic)
result = magic_cached(...)

# Or (for one-time use)
result = memory.eval(magic, ...)
Erik
  • 2,500
  • 2
  • 13
  • 26
John Salvatier
  • 3,077
  • 4
  • 26
  • 31
  • 1
    It would be better to have a quote from those links copied over in your answer, in case these websites go offline. – Alex Fortin Sep 16 '18 at 18:30
2

For small numpy arrays also this might be suitable:

tuple(map(float, a))

if a is the numpy array.

Woltan
  • 13,723
  • 15
  • 78
  • 104