3

I want to hash numpy arrays without copying the data into a bytearray first.

Specifically, I have a contiguous read-only two-dimensional int64 numpy array A with unique rows. To be concrete, let's say:

A = np.array([[1, 2], [3, 4], [5, 6]])
A.setflags(write=False)

I want to make a constant time function that maps an arbitrary array ap that's identical in value to a slice of A, e.g. A[i], to its index i. e.g.

foo(np.array([1, 2])) == 0
foo(np.array([3, 4])) == 1
foo(np.array([5, 6])) == 2

The natural choice is to make a dictionary like:

lookup = {a: i for i, a in enumerate(A)}

Unfortunately numpy arrays are not hashable. There are ways to hash numpy arrays, but ideally I'd like the equality to be preserved so I can use it in a dictionary without writing manual collision detection.

The referenced article does point out that I could do:

lookup = {a.data.tobytes(): i for i, a in enumerate(A)}

def foo(ap):
    return lookup[ap.data.tobytes()]

However the tobytes method returns a copy of the data pointed to by a.data, hence doubling the memory usage.

What I'd love to do is something like:

lookup = {a.data: i for i, a in enumerate(A)}

def foo(ap):
    return lookup[ap.data]

This would ideally use a pointer to the underlying memory instead of the array object or a copy of its bytes, but since a.dtype == int, this fails with:

ValueError: memoryview: hashing is restricted to formats 'B', 'b' or 'c'

This is fine, we can cast it Aview = A.view(np.byte), now we have:

>>> Aview.flags
#   C_CONTIGUOUS : True
#   F_CONTIGUOUS : False
#   OWNDATA : False
#   WRITEABLE : False
#   ALIGNED : True
#   UPDATEIFCOPY : False

>>> Aview.data.format
# 'b'

However, when trying to hash this, it still errors with:

TypeError: unhashable type: 'numpy.ndarray'

A possible solution (inspired by this) would be to define:

class _wrapper(object)
    def __init__(self, array):
        self._array = array
        self._hash = hash(array.data.tobytes())

    def __hash__(self):
        return self._hash

    def __eq__(self, other):
        return self._hash == other._hash and np.all(self._array == other._array)

lookup = {_wrapper(a): i for i, a in enumerate(A)}

def foo(ap):
    return lookup[_wrapper(ap)]

But this seems inelegant. Is there a way to tell python to just interpret the memoryview as a bytearray and hash it normally without having to make a copy to a bytestring or having numpy intercede and abort the hash?

Other things I've tried:

  1. The format of A does allow me to map each row into a distinct integer, but for very large A the space of possible arrays is larger than np.iinfo(int).max, and while I can use python's integer types, this is ~100x slower than just hashing the memory.

  2. I also tried doing something like:

    Aview = A.view(np.void(A.shape[1] * A.itemsize)).squeeze()

    However, even though A.flags.writeable == False, A[0].flags.writeable == True. When trying hash(A[0]) python raises TypeError: unhashable type: 'writeable void-scalar'. I'm unsure if it's possible to mark scalars as read-only, or otherwise hash a void scalar, even though most other scalars seem hashable.

Community
  • 1
  • 1
Erik
  • 6,470
  • 5
  • 36
  • 37
  • Are you trying to use whole array as key, or its elements individually? – hpaulj Jul 27 '16 at 19:33
  • @hpaulj I'm trying to hash slices of an array. I made significant updates to the question to that should make what I'm after more clear. – Erik Jul 27 '16 at 22:47

1 Answers1

0

I can't make sense of what you are trying to do.

When I create an array:

In [111]: A=np.array([1,0,1,2,3,0,2])
In [112]: A.__array_interface__
Out[112]: 
{'data': (173599672, False),
 'descr': [('', '<i4')],
 'shape': (7,),
 'strides': None,
 'typestr': '<i4',
 'version': 3}
In [113]: A.nbytes
Out[113]: 28
In [114]: id(A)
Out[114]: 2984144632

I get a ndarray object with a unique id, and attributes like shape, and strides, and data buffer. This buffer is 28 bytes starting at 173599672.

There isn't a A[3] object in A. I have to create it

In [115]: x=A[3]
In [116]: type(x)
Out[116]: numpy.int32
In [117]: id(x)
Out[117]: 2984723472
In [118]: x.__array_interface__
Out[118]: 
{'__ref': array(2),
 'data': (179546048, False),
 'descr': [('', '<i4')],
 'shape': (),
 'strides': None,
 'typestr': '<i4',
 'version': 3}

This x is in many ways just a 0d 1 element array (it displays differently). Notice that its data pointer is unrelated to that of A. So it isn't even sharing memory.

A slice does share memory

In [119]: y=A[3:4]
In [120]: y.__array_interface__
Out[120]: 
{'data': (173599684, False),   # 173599672+3*4
 'descr': [('', '<i4')],
 'shape': (1,),
 'strides': None,
 'typestr': '<i4',
 'version': 3}

====================

What exactly do you mean by mapping arbitrary A[i] to its B[i]? Are you using the value at A[i] as the key, or the location as the key? In my example, the elements of A are not unique. I can uniquely access A[0] or A[2] (by index), but in both cases I get a value of 1.

But consider this situation. There is a relatively fast way of finding a value in a 1d array - in1d.

In [121]: np.in1d(A,1)
Out[121]: array([ True, False,  True, False, False, False, False], dtype=bool)

Make the B array:

In [122]: B=np.arange(A.shape[0])

All the elements in B corresponding to a 1 value in A:

In [123]: B[np.in1d(A,1)]
Out[123]: array([0, 2])
In [124]: B[np.in1d(A,0)]   # to 0
Out[124]: array([1, 5])
In [125]: B[np.in1d(A,2)]   # to 2
Out[125]: array([3, 6])

A dictionary created from A gives the same (last) values:

In [134]: dict(zip(A,B))
Out[134]: {0: 5, 1: 2, 2: 6, 3: 4}

=====================

The paragraph about hashable in the Python docs talks about needing to have a __hash__ method.

So I checked a objects:

In [200]: {}.__hash__    # None
In [201]: [].__hash__    # None
In [202]: ().__hash__
Out[202]: <method-wrapper '__hash__' of tuple object at 0xb729302c>

In [204]: class MyClass(object):
     ...:     pass
     ...: 
In [205]: MyClass().__hash__
Out[205]: <method-wrapper '__hash__' of MyClass object at 0xb3008c4c>

A numpy integer - int with a np.int32 wrapper:

In [206]: x
Out[206]: 2
In [207]: x.__hash__
Out[207]: <method-wrapper '__hash__' of numpy.int32 object at 0xb1e748c0>
In [208]: x.__hash__()
Out[208]: 2

A numpy array

In [209]: A
Out[209]: 
array(['one', 'two'], 
      dtype='<U3')
In [210]: A.__hash__    # None

In [212]: np.float(12.232).__hash__()
Out[212]: 1219767578

So at a minimum the key for a dictionary must have a method of generating a hash, a unique identifier. It may be the instance id (default case), or maybe something derived from the values of the object (a checksum of some sort?). The dictionary maintains are table of these hashes, presumably with a pointer to both the key and the value. When I do a dictionary get, it generates the hash for the object I give it, looks that up, and returns the corresponding value - if present.

Classes that aren't hashable don't have a __hash__ method (or the method is None). They can't generate this unique id. Apparently by design an object of class np.ndarray does not have a __hash__. And playing with the writability flags does not change that.

The big problem with trying to hash or make a dictionary of the rows of an array is that you aren't interested in hashing a particular instance of an array (the object created by a slicing view), but a hash based on the values of that row.

So to take your 2d array:

In [236]: A
Out[236]: 
array([[1, 2],
       [3, 4],
       [5, 6]])

you want A[1,:], and np.array([3,4]) to both generate the same __hash__() value. And A[0,:]+2, and maybe A.mean(axis=0) (except that's a float array).

And since you are worried about memory, you must be dealing with very large arrays, say (1000,1000) - which implies a hash value based on 1000 different numbers, and some how unique.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Thanks for your help. I missed an important aspect, that is A is two dimensional, so the things I'm indexing are slices with shared memory that are still numpy arrays. My edits to the question should be more clear. Also, `np.in1d` is not a sufficient solution since it's `O(A.shape[0])`, and I'm looking for constant time in that dimension. If I wasn't, than `np.in1d` would work with casting each row to a void type. – Erik Jul 27 '16 at 22:52
  • This is starting to look like a `unique rows` problem. That's come up before. Another way to look at this is as a list of tuples. Tuples are hashable. Sparse `dok` format uses 2 element tuples as the key in a dictionary subclass. – hpaulj Jul 28 '16 at 01:16
  • It's certainly similar to the unique rows problem, but importantly different. Unique rows can be made pretty easily by casting to a void type and then using `np.unique`. I know these rows are unique, but I want to hash them without a memory copy. In addition to `a.data.tobytes()` I could certainly do `tuple(a)`, but that's going to have a larger not a smaller memory overhead. – Erik Jul 28 '16 at 14:24
  • How about an array subclass with a `__hash__` method, https://docs.python.org/2/glossary.html#term-hashable – hpaulj Jul 28 '16 at 15:09
  • I'm not sure where I found it, there are ways to do this, however due to the way that numpy subclassing and memory sharing works, it's nontrivial to implement. This is mostly an excuse, but also implementing what you describe has more functionality than I need. – Erik Jul 28 '16 at 18:28