58

Suppose I have a namedtuple like this:

EdgeBase = namedtuple("EdgeBase", "left, right")

I want to implement a custom hash-function for this, so I create the following subclass:

class Edge(EdgeBase):
    def __hash__(self):
        return hash(self.left) * hash(self.right)

Since the object is immutable, I want the hash-value to be calculated only once, so I do this:

class Edge(EdgeBase):
    def __init__(self, left, right):
        self._hash = hash(self.left) * hash(self.right)

    def __hash__(self):
        return self._hash

This appears to be working, but I am really not sure about subclassing and initialization in Python, especially with tuples. Are there any pitfalls to this solution? Is there a recommended way how to do this? Is it fine? Thanks in advance.

Björn Pollex
  • 75,346
  • 28
  • 201
  • 283
  • 1
    Tests have shown that it's not worth to cache the hash value, see [issue #9685](https://mail.python.org/pipermail/python-bugs-list/2013-January/192363.html) – John La Rooy Nov 01 '16 at 23:25
  • 2
    Side-note: The canonical way to combine hashes is to compute the hash of a `tuple` of the combined values; `tuple` uses a more tested, reliable way of combining hashes than most naive solutions (significantly better than mere multiplication of hashes). So if you just didn't override hash, `tuple`'s default hash would already be correct/efficient (more efficient than anything you'll write because it doesn't execute any Python level byte code at all). Even if this wasn't a `namedtuple`, the correct solution would be `hash((self.left, self.right))`. – ShadowRanger Sep 06 '18 at 23:01
  • @ShadowRanger good point. This was quite a while ago, but if I remember correctly, the hash-function was chosen like this because the class represented undirected edges, so in this case it the `tuple` approach wouldn't work. – Björn Pollex Sep 10 '18 at 07:38
  • @BjörnPollex: In that case, the `hash` of a `frozenset` of the values would work. Not as good at producing unique hashes as `tuple`'s hash (`frozenset` just shuffles up the bits predictably, then xors them together), but of course, in this case, that's the goal. – ShadowRanger Sep 10 '18 at 12:09

3 Answers3

60

edit for 2017: turns out namedtuple isn't a great idea. attrs is the modern alternative.

class Edge(EdgeBase):
    def __new__(cls, left, right):
        self = super(Edge, cls).__new__(cls, left, right)
        self._hash = hash(self.left) * hash(self.right)
        return self

    def __hash__(self):
        return self._hash

__new__ is what you want to call here because tuples are immutable. Immutable objects are created in __new__ and then returned to the user, instead of being populated with data in __init__.

cls has to be passed twice to the super call on __new__ because __new__ is, for historical/odd reasons implicitly a staticmethod.

habnabit
  • 9,906
  • 3
  • 32
  • 26
  • 4
    Why should I prefer this to my solution (this is not meant to be a cynical question, I really want to understand if there are any significant differences)? – Björn Pollex Sep 02 '10 at 08:00
  • 9
    Your solution doesn't use `super`, and therefore will break in any sort of multiple inheritance situation. It doesn't matter if *you* don't use MI; if someone else does, their code will break horribly. It's not difficult to avoid this issue by just using `super` everywhere. – habnabit Sep 02 '10 at 08:06
  • 7
    It's also beneficial to add `__slots__ = ()` to the class. Otherwise `__dict__` will be created, negating memory efficiency of `namedtuple`. – WGH Sep 25 '13 at 14:14
  • Setting `__slots__` would break the assignment to _hash and clearly the class isn't readonly as that would also break the assignment to _hash regardless of whether it's done in `__new__`. – Gordon Wrigley Jan 05 '14 at 21:59
  • It is worth noting that unlike a pure `namedtuple`, the derived class will actually allow unknown parameters to be set on the instance, which may not be what you want. – ereOn May 15 '15 at 20:06
  • 1
    You can do slots = ('_hash',) instead of (). Someone assigning _hash will still break this class, but you have marked it private, so you have done your job. Also, I would use + instead of * for the hash function. – Evan Apr 11 '16 at 20:01
  • 12
    `attrs` cannot be considered "the" modern alternative unless and until it replaces `namedtuple` in the standard library. Not needing third-party dependencies is still a common design constraint. – zwol Jan 08 '18 at 23:44
  • @zwol that's usually a misguided constraint :) there's lots of tooling available these days for making third-party dependencies a non-issue, like pipenv and pex. – habnabit Jan 09 '18 at 23:00
  • 1
    @zwol also consider that even the python stdlib recognizes its own deficient state: https://docs.python.org/3/library/urllib.request.html#module-urllib.request has a link to `requests`. – habnabit Jan 09 '18 at 23:05
  • 1
    @habnabit I cannot agree with the assertion that this is a misguided constraint. I'm not interested in discussing it any further. – zwol Jan 09 '18 at 23:45
  • @zwol ok! I just want to reiterate that the third-party library situation in python has been rapidly improving. it used to be less good, so I can empathize with the thought. – habnabit Jan 11 '18 at 01:07
  • 1
    **The standard `@dataclasses.dataclass` decorator is the modern alternative** – especially when passed `frozen=True` to enforce immutability ala `tuple` and `namedtuple`. `@dataclass` is well-maintained, well-documented, behaves sensibly under edge cases, and requires *no* third-party dependencies. `attrs` is *not* the modern alternative. @zwol is correct: increasing the fragility of your stack by requiring a new dependency just for dataclass functionality that Python itself already offers out-of-the-box is absurd and dangerous. – Cecil Curry Nov 15 '22 at 06:27
5

In Python 3.7+, you can now use dataclasses to build hashable classes with ease.

Code

Assuming int types of left and right, we use the default hashing via unsafe_hash+ keyword:

import dataclasses as dc


@dc.dataclass(unsafe_hash=True)
class Edge:
    left: int
    right: int


hash(Edge(1, 2))
# 3713081631934410656

Now we can use these (mutable) hashable objects as elements in a set or (keys in a dict).

{Edge(1, 2), Edge(1, 2), Edge(2, 1), Edge(2, 3)}
# {Edge(left=1, right=2), Edge(left=2, right=1), Edge(left=2, right=3)}

Details

We can alternatively override the __hash__ function:

@dc.dataclass
class Edge:
    left: int
    right: int

    def __post_init__(self):
        # Add custom hashing function here
        self._hash = hash((self.left, self.right))         # emulates default

    def __hash__(self):
        return self._hash


hash(Edge(1, 2))
# 3713081631934410656

Expanding on @ShadowRanger's comment, the OP's custom hash function is not reliable. In particular, the attribute values can be interchanged, e.g. hash(Edge(1, 2)) == hash(Edge(2, 1)), which is likely unintended.

+Note, the name "unsafe" suggests the default hash will be used despite being a mutable object. This may be undesired, particularly within a dict expecting immutable keys. Immutable hashing can be turned on with the appropriate keywords. See also more on hashing logic in dataclasses and a related issue.

pylang
  • 40,867
  • 14
  • 129
  • 121
4

The code in the question could benefit from a super call in the __init__ in case it ever gets subclassed in a multiple inheritance situation, but otherwise is correct.

class Edge(EdgeBase):
    def __init__(self, left, right):
        super(Edge, self).__init__(left, right)
        self._hash = hash(self.left) * hash(self.right)

    def __hash__(self):
        return self._hash

While tuples are readonly only the tuple parts of their subclasses are readonly, other properties may be written as usual which is what allows the assignment to _hash regardless of whether it's done in __init__ or __new__. You can make the subclass fully readonly by setting it's __slots__ to (), which has the added benefit of saving memory, but then you wouldn't be able to assign to _hash.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
Gordon Wrigley
  • 11,015
  • 10
  • 48
  • 62