92

I need to be able to store a numpy array in a dict for caching purposes. Hash speed is important.

The array represents indicies, so while the actual identity of the object is not important, the value is. Mutabliity is not a concern, as I'm only interested in the current value.

What should I hash in order to store it in a dict?

My current approach is to use str(arr.data), which is faster than md5 in my testing.


I've incorporated some examples from the answers to get an idea of relative times:

In [121]: %timeit hash(str(y))
10000 loops, best of 3: 68.7 us per loop

In [122]: %timeit hash(y.tostring())
1000000 loops, best of 3: 383 ns per loop

In [123]: %timeit hash(str(y.data))
1000000 loops, best of 3: 543 ns per loop

In [124]: %timeit y.flags.writeable = False ; hash(y.data)
1000000 loops, best of 3: 1.15 us per loop

In [125]: %timeit hash((b*y).sum())
100000 loops, best of 3: 8.12 us per loop

It would appear that for this particular use case (small arrays of indicies), arr.tostring offers the best performance.

While hashing the read-only buffer is fast on its own, the overhead of setting the writeable flag actually makes it slower.

sapi
  • 9,944
  • 8
  • 41
  • 71
  • 2
    `arr.tostring()` does the same and is more aesthetically pleasing. If you have really big arrays you could try stringifying only a small part of the array. – root May 16 '13 at 14:45
  • 1
    `tostring` also appears to be orders of magnitude faster for small arrays (although 4× slower for an array of 10000 elements). – Fred Foo May 16 '13 at 15:49
  • 14
    ... which is actually quite obvious, because `str` only formats the head and tail of the array. – Fred Foo May 16 '13 at 15:56
  • Am I mistaken or is `str(arr.data)` simply wrong? I used this on different arrays and got the same strings back...!? – Make42 Feb 06 '21 at 17:50

5 Answers5

72

You can simply hash the underlying buffer, if you make it read-only:

>>> a = random.randint(10, 100, 100000)
>>> a.flags.writeable = False
>>> %timeit hash(a.data)
100 loops, best of 3: 2.01 ms per loop
>>> %timeit hash(a.tostring())
100 loops, best of 3: 2.28 ms per loop

For very large arrays, hash(str(a)) is a lot faster, but then it only takes a small part of the array into account.

>>> %timeit hash(str(a))
10000 loops, best of 3: 55.5 us per loop
>>> str(a)
'[63 30 33 ..., 96 25 60]'
Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • Thanks. I'm going to use `tostring` for now, but I might investigate changing my input arguments a little so that I can make use of read-only buffers the whole way through, making hash faster. – sapi May 16 '13 at 21:13
  • 54
    In Python 3.4 I found I had to use ``hash(a.data.tobytes())`` – ariddell Oct 16 '14 at 02:41
  • 4
    Sorry for coming to this kind late, but using `hash(a.data.tobytes())` as @ariddell suggested means I don't have to set `a.flags.writeable = false`. Any reason for this and any potential problems in doing so? – SCB Sep 30 '15 at 09:11
  • 3
    @SCB `tobytes()` makes a complete copy of the underlying data, (which could be time-consuming for huge arrays), and returns a bytearray which can then be hashed. So if you modify `a` afterwards, you will need to call a.data.tobytes() again, wasting more time/memory. In python 2, you use hash(a.data) which doesn't take a copy of the data, and the hash of it is cached so that future calls to hash(a.data) return instantly. But that's the problem: changing `a` doesn't clear the cache, so the hash is incorrect if `a` is changed. I'm not sure if a no-copy hash is possible at all in Python 3. – Widjet Nov 12 '15 at 03:05
  • 50
    Attention: The hash calculated with this method changes between processes. i.e. `hash(np.array([0,1]).data.tobytes())` has a different value in different python process. It is not a deterministic value calculated from the array content. To get a deterministic behavior you need to set the `PYTHONHASHSEED` environment variable. More info: https://stackoverflow.com/questions/27522626/hash-function-in-python-3-3-returns-different-results-between-sessions – Harald Thomson Apr 16 '18 at 14:03
  • 3
    Using `hash(str(a))` is dangerous and leads to wrong results for long arrays, as only the **shortened** string representation of the array, namely `'[63 30 33 ..., 96 25 60]`, is hashed (this is probably also the reason why this method is faster?). Particularly, all arrays `a` having the same starting and ending values end up with the same hash. Could you please add a short note on that in our answer? – Tobias Windisch Jan 09 '21 at 19:40
  • 1
    tostrings() is deprecated now and numpy ask us to use tobytes() and a.data gives ValueError: memoryview: hashing is restricted to formats 'B', 'b' or 'c', what should I do? – codeslord Oct 11 '21 at 14:09
  • 3
    If you use hashlib you 1. will get a deterministic value and 2. don't have to set the array to read-only and 3. the operation is not restricted to any format: `hashlib.sha256( a.data )`. – Jann Poppinga Nov 01 '21 at 07:22
  • You have to be careful with `hash(a.data.tobytes())` when the arrays are of different shape. For example, `np.zeros((2, 1))` and `np.zeros((1, 2))` have the same bytes. – Daniel S. Aug 12 '22 at 13:46
40

You can try xxhash via its Python binding. For large arrays this is much faster than hash(x.tostring()).

Example IPython session:

>>> import xxhash
>>> import numpy
>>> x = numpy.random.rand(1024 * 1024 * 16)
>>> h = xxhash.xxh64()
>>> %timeit hash(x.tostring())
1 loops, best of 3: 208 ms per loop
>>> %timeit h.update(x); h.intdigest(); h.reset()
100 loops, best of 3: 10.2 ms per loop

And by the way, on various blogs and answers posted to Stack Overflow, you'll see people using sha1 or md5 as hash functions. For performance reasons this is usually not acceptable, as those "secure" hash functions are rather slow. They're useful only if hash collision is one of the top concerns.

Nevertheless, hash collisions happen all the time. And if all you need is implementing __hash__ for data-array objects so that they can be used as keys in Python dictionaries or sets, I think it's better to concentrate on the speed of __hash__ itself and let Python handle the hash collision[1].

[1] You may need to override __eq__ too, to help Python manage hash collision. You would want __eq__ to return a boolean, rather than an array of booleans as is done by numpy.

Cong Ma
  • 10,692
  • 3
  • 31
  • 47
  • I think non-cryptographic hashes also try to prevent collisions for 'normal' data, right? The crypto part is that a malicious attacker cannot be more likely to find a collision or know learn about the hashes object. So like this answer says, definitely don't use sha1 or md5 when performance is an issue and security is not. – Mark Nov 08 '16 at 14:01
  • The fourth line should be `h = xxhash.xxh64()` – Micah Smith Apr 12 '17 at 14:30
  • 1
    @MicahSmith Thanks. Fixed. – Cong Ma Apr 13 '17 at 03:34
  • `xxhash` from python has a limit of something like 2**32 bytes, so you may need to chunk up your array and combine the hashes using something like: https://stackoverflow.com/questions/113511/best-implementation-for-hashcode-method – nimish Feb 03 '18 at 18:26
  • 2
    Note that `xxhash` is deterministic and yields the same result for different python processes. This is not the case (by default) for `hash`, which uses a random seed for every new process. See Harald Thomson's comment or this [SO thread](https://stackoverflow.com/questions/27522626). – normanius Mar 03 '20 at 13:35
9

If your np.array() is small and in a tight loop, then one option is to skip hash() completely and just use np.array().data.tobytes() directly as your dict key:

grid  = np.array([[True, False, True],[False, False, True]])
hash  = grid.data.tobytes()
cache = cache or {}
if hash not in cache:
    cache[hash] = function(grid)
return cache[hash]
James McGuigan
  • 7,542
  • 4
  • 26
  • 29
  • This works great for me. Remember, a dictionary is going to hash your keys anyway, so as long as the arrays are small and memory isn't too big a concern, this is a good choice. – gregamis Mar 06 '23 at 17:54
8

Coming late to the party, but for large arrays, I think a decent way to do it is to randomly subsample the matrix and hash that sample:

def subsample_hash(a):
    rng = np.random.RandomState(89)
    inds = rng.randint(low=0, high=a.size, size=1000)
    b = a.flat[inds]
    b.flags.writeable = False
    return hash(b.data)

I think this is better than doing hash(str(a)), because the latter could confuse arrays that have unique data in the middle but zeros around the edges.

hunse
  • 3,175
  • 20
  • 25
  • 3
    People doing one-hot coding will be sad. – user48956 Oct 23 '19 at 02:57
  • @user48956 : Yes, when you have sparse data (including one-hot coded data), any approach that works from the dense version of the matrix will be sub-optimal (hashing the whole thing would be slow, and hashing only part could miss the non-zero data values). Better to work from a sparse representation of the data (i.e. indices and values of the non-zero elements) and hash both the indices and values. – hunse Oct 24 '19 at 17:23
2

What kind of data do you have?

  • array-size
  • do you have an index several times in the array

If your array only consists of permutation of indices you can use a base-convertion

(1, 0, 2) -> 1 * 3**0 + 0 * 3**1 + 2 * 3**2 = 10(base3)

and use '10' as hash_key via

import numpy as num

base_size = 3
base = base_size ** num.arange(base_size)
max_base = (base * num.arange(base_size)).sum()

hashed_array = (base * array).sum()

Now you can use an array (shape=(base_size, )) instead of a dict in order to access the values.

Hensing
  • 421
  • 3
  • 6
  • 1
    Why the list comprehension? This can be done much faster in NumPy as `base_size ** np.arange(base_size)`. – Fred Foo May 16 '13 at 15:46
  • Interesting approach, although slower for small arrays. I'll keep this in mind if I need to play with anything large :) – sapi May 16 '13 at 21:13