13

I have some problems understanding how numpy objects hashability is managed.

>>> import numpy as np
>>> class Vector(np.ndarray):
...     pass
>>> nparray = np.array([0.])
>>> vector = Vector(shape=(1,), buffer=nparray)
>>> ndarray = np.ndarray(shape=(1,), buffer=nparray)
>>> nparray
array([ 0.])
>>> ndarray
array([ 0.])
>>> vector
Vector([ 0.])
>>> '__hash__' in dir(nparray)
True
>>> '__hash__' in dir(ndarray)
True
>>> '__hash__' in dir(vector)
True
>>> hash(nparray)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'numpy.ndarray'
>>> hash(ndarray)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'numpy.ndarray'
>>> hash(vector)
-9223372036586049780
>>> nparray.__hash__()
269709177
>>> ndarray.__hash__()
269702147
>>> vector.__hash__()
-9223372036586049780
>>> id(nparray)
4315346832
>>> id(ndarray)
4315234352
>>> id(vector)
4299616456
>>> nparray.__hash__() == id(nparray)
False
>>> ndarray.__hash__() == id(ndarray)
False
>>> vector.__hash__() == id(vector)
False
>>> hash(vector) == vector.__hash__()
True

How come

  • numpy objects define a __hash__ method but are however not hashable
  • a class deriving numpy.ndarray defines __hash__ and is hashable?

Am I missing something?

I'm using Python 2.7.1 and numpy 1.6.1

Thanks for any help!

EDIT: added objects ids

EDIT2: And following deinonychusaur comment and trying to figure out if hashing is based on content, I played with numpy.nparray.dtype and have something I find quite strange:

>>> [Vector(shape=(1,), buffer=np.array([1], dtype=mytype), dtype=mytype) for mytype in ('float', 'int', 'float128')]
[Vector([ 1.]), Vector([1]), Vector([ 1.0], dtype=float128)]
>>> [id(Vector(shape=(1,), buffer=np.array([1], dtype=mytype), dtype=mytype)) for mytype in ('float', 'int', 'float128')]
[4317742576, 4317742576, 4317742576]
>>> [hash(Vector(shape=(1,), buffer=np.array([1], dtype=mytype), dtype=mytype)) for mytype in ('float', 'int', 'float128')]
[269858911, 269858911, 269858911]

I'm puzzled... is there some (type independant) caching mechanism in numpy?

marchelbling
  • 1,909
  • 15
  • 23
  • This seems to show how you could get it to work, seems to deal with the fact that the array is mutable. http://stackoverflow.com/a/5173201/1099682 – deinonychusaur Mar 20 '12 at 11:48
  • I understand that mutable objects should not be hashable. But here, my `Vector`class simply derives from `numpy.ndarray` which is not hashable, yet the `Vector` class is, even if it's mutable. – marchelbling Mar 20 '12 at 12:46
  • It seems to me that what is hashed is the memory ref or something, if you just repeat the vector = Vector(shape=(1,), buffer=nparray) and check its hash it should have changed. – deinonychusaur Mar 20 '12 at 13:05

2 Answers2

8

I get the same results in Python 2.6.6 and numpy 1.3.0. According to the Python glossary, an object should be hashable if __hash__ is defined (and is not None), and either __eq__ or __cmp__ is defined. ndarray.__eq__ and ndarray.__hash__ are both defined and return something meaningful, so I don't see why hash should fail. After a quick google, I found this post on the python.scientific.devel mailing list, which states that arrays have never been intended to be hashable - so why ndarray.__hash__ is defined, I have no idea. Note that isinstance(nparray, collections.Hashable) returns True.

EDIT: Note that nparray.__hash__() returns the same as id(nparray), so this is just the default implementation. Maybe it was difficult or impossible to remove the implementation of __hash__ in earlier versions of python (the __hash__ = None technique was apparently introduced in 2.6), so they used some kind of C API magic to achieve this in a way that wouldn't propagate to subclasses, and wouldn't stop you from calling ndarray.__hash__ explicitly?

Things are different in Python 3.2.2 and the current numpy 2.0.0 from the repo. The __cmp__ method no longer exists, so hashability now requires __hash__ and __eq__ (see Python 3 glossary). In this version of numpy, ndarray.__hash__ is defined, but it is just None, so cannot be called. hash(nparray) fails andisinstance(nparray, collections.Hashable) returns False as expected. hash(vector) also fails.

James
  • 3,191
  • 1
  • 23
  • 39
  • THanks a lot for your answer. Regarding your edit, I do not reproduce what you say. I actually have >>> nparray.__hash__() 269709177 >>> id(nparray) 4315346832 so I'm still puzzled. I'm adding this in my post as code is unreadable in comments – marchelbling Mar 20 '12 at 14:00
  • Hmm... the hashes are the same the ids for me in Python 2.6.6 and numpy 1.3.0, but not in Python 2.7.2 and numpy 1.5.1. This is all very strange. It would probably help if I didn't happen to have a different version of numpy on every version of Python I have knocking around. Anyway, as far as I can see, the default definition of `__hash__` has returned the `id` for quite some time, so I suppose they must have explicitly overridden it to do something different in at least some versions of numpy. – James Mar 20 '12 at 15:22
2

This is not a clear answer, but here is some track to follow to understand this behavior.

I refer here to the numpy code of the 1.6.1 release.

According to numpy.ndarray object implementation (look at, numpy/core/src/multiarray/arrayobject.c), hash method is set to NULL.

NPY_NO_EXPORT PyTypeObject PyArray_Type = {
#if defined(NPY_PY3K)
    PyVarObject_HEAD_INIT(NULL, 0)
#else
    PyObject_HEAD_INIT(NULL)
    0,                                          /* ob_size */
#endif
    "numpy.ndarray",                            /* tp_name */
    sizeof(PyArrayObject),                      /* tp_basicsize */
    &array_as_mapping,                          /* tp_as_mapping */
    (hashfunc)0,                                /* tp_hash */

This tp_hash property seems to be overridden in numpy/core/src/multiarray/multiarraymodule.c. See DUAL_INHERIT, DUAL_INHERIT2 and initmultiarray function where tp_hash attribute is modified.

Ex: PyArrayDescr_Type.tp_hash = PyArray_DescrHash

According to hashdescr.c, hash is implemented as follow:

* How does this work ? The hash is computed from a list which contains all the
* information specific to a type. The hard work is to build the list
* (_array_descr_walk). The list is built as follows:
*      * If the dtype is builtin (no fields, no subarray), then the list
*      contains 6 items which uniquely define one dtype (_array_descr_builtin)
*      * If the dtype is a compound array, one walk on each field. For each
*      field, we append title, names, offset to the final list used for
*      hashing, and then append the list recursively built for each
*      corresponding dtype (_array_descr_walk_fields)
*      * If the dtype is a subarray, one adds the shape tuple to the list, and
*      then append the list recursively built for each corresponding type
*      (_array_descr_walk_subarray)
ohe
  • 3,461
  • 3
  • 26
  • 50