I want to hash numpy arrays without copying the data into a bytearray first.
Specifically, I have a contiguous read-only two-dimensional int64 numpy array A
with unique rows. To be concrete, let's say:
A = np.array([[1, 2], [3, 4], [5, 6]])
A.setflags(write=False)
I want to make a constant time function that maps an arbitrary array ap
that's identical in value to a slice of A
, e.g. A[i]
, to its index i
. e.g.
foo(np.array([1, 2])) == 0
foo(np.array([3, 4])) == 1
foo(np.array([5, 6])) == 2
The natural choice is to make a dictionary like:
lookup = {a: i for i, a in enumerate(A)}
Unfortunately numpy arrays are not hashable. There are ways to hash numpy arrays, but ideally I'd like the equality to be preserved so I can use it in a dictionary without writing manual collision detection.
The referenced article does point out that I could do:
lookup = {a.data.tobytes(): i for i, a in enumerate(A)}
def foo(ap):
return lookup[ap.data.tobytes()]
However the tobytes
method returns a copy of the data pointed to by a.data
, hence doubling the memory usage.
What I'd love to do is something like:
lookup = {a.data: i for i, a in enumerate(A)}
def foo(ap):
return lookup[ap.data]
This would ideally use a pointer to the underlying memory instead of the array object or a copy of its bytes, but since a.dtype == int
, this fails with:
ValueError: memoryview: hashing is restricted to formats 'B', 'b' or 'c'
This is fine, we can cast it Aview = A.view(np.byte)
, now we have:
>>> Aview.flags
# C_CONTIGUOUS : True
# F_CONTIGUOUS : False
# OWNDATA : False
# WRITEABLE : False
# ALIGNED : True
# UPDATEIFCOPY : False
>>> Aview.data.format
# 'b'
However, when trying to hash this, it still errors with:
TypeError: unhashable type: 'numpy.ndarray'
A possible solution (inspired by this) would be to define:
class _wrapper(object)
def __init__(self, array):
self._array = array
self._hash = hash(array.data.tobytes())
def __hash__(self):
return self._hash
def __eq__(self, other):
return self._hash == other._hash and np.all(self._array == other._array)
lookup = {_wrapper(a): i for i, a in enumerate(A)}
def foo(ap):
return lookup[_wrapper(ap)]
But this seems inelegant. Is there a way to tell python to just interpret the memoryview as a bytearray and hash it normally without having to make a copy to a bytestring or having numpy intercede and abort the hash?
Other things I've tried:
The format of A does allow me to map each row into a distinct integer, but for very large
A
the space of possible arrays is larger thannp.iinfo(int).max
, and while I can use python's integer types, this is ~100x slower than just hashing the memory.I also tried doing something like:
Aview = A.view(np.void(A.shape[1] * A.itemsize)).squeeze()
However, even though
A.flags.writeable == False
,A[0].flags.writeable == True
. When tryinghash(A[0])
python raisesTypeError: unhashable type: 'writeable void-scalar'
. I'm unsure if it's possible to mark scalars as read-only, or otherwise hash a void scalar, even though most other scalars seem hashable.