4

I noticed that when I use user-defined objects (that override the __hash__ method) as keys of my dicts in Python, lookup time increases by at least a factor 5.

This behaviour is observed even when I use very basic hash methods such as in the following example:

class A:
    def __init__(self, a):
        self.a = a
    def __hash__(self):
        return hash(self.a)
    def __eq__(self, other):
        if not isinstance(other, A):
            return NotImplemented
        return (self.a == other.a and self.__class__ == 
                other.__class__)

# get an instance of class A
mya = A(42)
# define dict
d1={mya:[1,2], 'foo':[3,4]}

If I time the access through the two different keys I observe a significant difference in performance

%timeit d1['foo']

results in ~ 100 ns. Whereas

%timeit d1[mya]

results in ~ 600 ns.

If I remove the overwriting of __hash__ and __eq__ methods, performance is at the same level as for a default object

Is there a way to avoid this loss in performance and still implement a customised hash calculation ?

  • Not easily. Your `__hash__` and `__eq__` are Python functions, the corresponding methods of builtin types are written in C, therefore faster. – L3viathan Sep 01 '18 at 12:05

1 Answers1

8

The default CPython __hash__ implementation for a custom class is written in C and uses the memory address of the object. Therefore, it does not have to access absolutely anthing from the object and can be done very quickly, as it is just a single integer operation in CPU, if even that.

The "very basic" __hash__ from the example is not as simple as it may seem:

def __hash__(self):
    return hash(self.a)

This has to read the attribute a of self, which I'd say in this case will call object.__getattribute__(self, 'a'), and that will look for the value of 'a' in __dict__. This already involves calculating hash('a') and looking it up. Then, the returned value will be passed to hash.


To answer the additional question:

Is there a way to implement a faster __hash__ method that returns predictable values, I mean that are not randomly computed at each run as in the case of the memory address of the object ?

Anything accessing attributes of objects will be slower than the implementation which does not need to access attributes, but you could make attribute access faster by using __slots__, or implementing a highly optimized C extension for the class.

There is, however, another question: is this really a problem? I cannot really believe that an application is becoming slow because of slow __hash__. __hash__ should still be pretty fast unless the dictionary has trillions of entries, but then, everything else would become slow and ask for bigger changes...


I did some testing and have to make a correction. Using __slots__ is not going to help in this case at all. My tests actually showed that in CPython 3.7 the above class becomes slightly slower when using __slots__.

zvone
  • 18,045
  • 3
  • 49
  • 77
  • Is there a way to implement a faster `__hash__` method that returns predictable values, I mean that are not randomly computed at each run as in the case of the memory address of the object ? – Paolo Gervasoni Vila Sep 01 '18 at 12:39
  • @PaoloGervasoniVila See addition to my answer. – zvone Sep 01 '18 at 12:47
  • 1
    @PaoloGervasoniVila: Compute the hash during `__init__` and store it as an attribute, so the implementation become `def __hash__(self): return self.precomphash`. Still going to be slower than hashing Python built-ins (and use more memory than before), but that's just how Python works; built-ins (and C extensions) are faster, period. – ShadowRanger Sep 01 '18 at 12:53