20

I am more familiar with the "Java way" of building complex / combined hash codes from superclasses in subclasses. Is there a better / different / preferred way in Python 3? (I cannot find anything specific to Python3 on this matter via Google.)

class Superclass:
    def __init__(self, data):
        self.__data = data

    def __hash__(self):
        return hash(self.__data)

class Subclass(Superclass):
    def __init__(self, data, more_data):
        super().__init__(data)
        self.__more_data = more_data

    def __hash__(self):
        # Just a guess...
        return hash(super()) + 31 * hash(self.__more_data)

To simplifying this question, please assume self.__data and self.__more_data are simple, hashable data, such as str or int.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
kevinarpe
  • 20,319
  • 26
  • 127
  • 154

4 Answers4

31

The easiest way to produce good hashes is to put your values in a standard hashable Python container, then hash that. This includes combining hashes in subclasses. I'll explain why, and then how.

Base requirements

First things first:

  • If two objects test as equal, then they MUST have the same hash value
  • Objects that have a hash, MUST produce the same hash over time.

Only when you follow those two rules can your objects safely be used in dictionaries and sets. The hash not changing is what keeps dictionaries and sets from breaking, as they use the hash to pick a storage location, and won't be able to locate the object again given another object that tests equal if the hash changed.

Note that it doesn’t even matter if the two objects are of different types; True == 1 == 1.0 so all have the same hash and will all count as the same key in a dictionary.

What makes a good hash value

You'd want to combine the components of your object value in ways that will produce, as much as possible, different hashes for different values. That includes things like ordering and specific meaning, so that two attributes that represent different aspects of your value, but that can hold the same type of Python objects, still result in different hashes, most of the time.

Note that it's fine if two objects that represent different values (won't test equal) have equal hashes. Reusing a hash value won't break sets or dictionaries. However, if a lot of different object values produce equal hashes then that reduces their efficiency, as you increase the likelihood of collisions. Collisions require collision resolution and collision resolution takes more time, so much so that you can use denial of service attacks on servers with predictable hashing implementations) (*).

So you want a nice wide spread of possible hash values.

Pitfalls to watch out for

The documentation for the object.__hash__ method includes some advice on how to combine values:

The only required property is that objects which compare equal have the same hash value; it is advised to somehow mix together (e.g. using exclusive or) the hash values for the components of the object that also play a part in comparison of objects.

but only using XOR will not produce good hash values, not when the values whose hashes that you XOR together can be of the same type but have different meaning depending on the attribute they've been assigned to. To illustrate with an example:

>>> class Foo:
...     def __init__(self, a, b):
...         self.a = a
...         self.b = b
...     def __hash__(self):
...         return hash(self.a) ^ hash(self.b)
...
>>> hash(Foo(42, 'spam')) == hash(Foo('spam', 42))
True

Because the hashes for self.a and self.b were just XOR-ed together, we got the same hash value for either order, and so effectively halving the number of usable hashes. Do so with more attributes and you cut the number of unique hashes down rapidly. So you may want to include a bit more information in the hash about each attribute, if the same values can be used in different elements that make up the hash.

Next, know that while Python integers are unbounded, hash values are not. That is to say, hashes values have a finite range. From the same documentation:

Note: hash() truncates the value returned from an object’s custom __hash__() method to the size of a Py_ssize_t. This is typically 8 bytes on 64-bit builds and 4 bytes on 32-bit builds.

This means that if you used addition or multiplication or other operations that increase the number of bits needed to store the hash value, you will end up losing the upper bits and so reduce the number of different hash values again.

Next, if you combine multiple hashes with XOR that already have a limited range, chances are you end up with an even smaller number of possible hashes. Try XOR-ing the hashes of 1000 random integers in the range 0-10, for an extreme example.

Hashing, the easy way

Python developers have long since wrestled with the above pitfalls, and solved it for the standard library types. Use this to your advantage. Put your values in a tuple, then hash that tuple.

Python tuples use a simplified version of the xxHash algorithm to capture order information and to ensure a broad range of hash values. So for different attributes, you can capture the different meanings by giving them different positions in a tuple, then hashing the tuple:

def __hash__(self):
    return hash((self.a, self.b))

This ensures you get unique hash values for unique orderings.

If you are subclassing something, put the hash of the parent implementation into one of the tuple positions:

def __hash__(self):
    return hash((super().__hash__(), self.__more_data))

Hashing a hash value does reduce it to a 60-bit or 30-bit value (on 32-bit or 64-bit platforms, respectively), but that's not a big problem when combined with other values in a tuple. If you are really concerned about this, put None in the tuple as a placeholder and XOR the parent hash (so super().__hash__() ^ hash((None, self.__more_data))). But this is overkill, really.

If you have a multiple values whose relative order doesn't matter, and don't want to XOR these all together one by one, consider using a frozenset() object for fast processing, combined with a collections.Counter() object if values are not meant to be unique. The frozenset() hash operation accounts for small hash ranges by reshuffling the bits in hashes first:

# unordered collection hashing
from collections import Counter
hash(frozenset(Counter(...).items()))

As always, all values in the tuple or frozenset() must be hashable themselves.

Consider using dataclasses

For most objects you write __hash__ functions for, you actually want to be using a dataclass generated class:

from dataclasses import dataclass
from typing import Union

@dataclass(frozen=True)
class Foo:
    a: Union[int, str]
    b: Union[int, str]

Dataclasses are given a sane __hash__ implementation when frozen=True or unsafe_hash=True, using a tuple() of all the field values.


(*) Python protects your code against such hash collision attacks by using a process-wide random hash seed to hash strings, bytes and datetime objects.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Tip: convert any list attributes to tuple: `return hash((self.a, tuple(self.b_list)))` – Bob Stein Sep 15 '21 at 22:44
  • 1
    @BobStein: that applies to any mutable (unhashable) data structure, not just lists, and you may need to convert values recursively. You could use `hash((self.a, *self.b_list))` here to pull out all the values of `self.b_list` into the tuple created for hashing as an alternative to using `tuple()`. – Martijn Pieters Sep 16 '21 at 11:57
6

The python documentation suggests that you use xor to combine hashes:

The only required property is that objects which compare equal have the same hash value; it is advised to somehow mix together (e.g. using exclusive or) the hash values for the components of the object that also play a part in comparison of objects.

I'd also recommend xor over addition and multiplication because of this:

Note

hash() truncates the value returned from an object’s custom __hash__() method to the size of a Py_ssize_t. This is typically 8 bytes on 64-bit builds and 4 bytes on 32-bit builds. If an object’s __hash__() must interoperate on builds of different bit sizes, be sure to check the width on all supported builds. An easy way to do this is with python -c "import sys; print(sys.hash_info.width)"

This documentation is the same for python 2.7 and python 3.4, by the way.

A note on symmetry and xoring items with themselves.

As pointed out in the comments, xor is symmetric, so order of operations disappears. The XOR of two the same elements is also zero. So if that's not desired mix in some rotations or shifts, or, even better, use this solution's suggestion of taking the hash of a tuple of the identifying elements. If you don't want to preserve order, consider using the frozenset.

Thom Wiggers
  • 6,938
  • 1
  • 39
  • 65
  • Great answer. Thanks for the reference. Re: "truncates" -- is this due to Python's unrestricted integral values? *Integers have unlimited precision.* (Most are surprised to learn this about Python!) – kevinarpe Apr 04 '15 at 09:17
  • 1
    I am aware that integers have infinite precision. However, `hash()` does NOT have infinite precision. It is implemented to return a `Py_ssize_t` which will most likely be 8 bytes, so it will return the result of `__hash__` mod 2^64-1. – Thom Wiggers Apr 04 '15 at 09:22
  • is this a reasonable implementation: `return super().__hash__() ^ hash(self.__more_data)`? – kevinarpe Apr 04 '15 at 10:30
  • The problem with XOR for combining hashes is that it is symmetric: `a^b == b^a`. The following 3 will all produce the same hash: `'guido' ^ 'van' ^ 'rossum'` `'rossum' ^ 'van' ^ 'guido'` `'van' ^ 'guido' ^ 'rossum'`. If this is intended, fine. But more often than not, this is undesirable and can lead to unexpected side effects. Like values in dictionaries getting unintentionally overwritten, or sets "containing" objects that have not been added to them. (omitted `.__hash__()` for clarity) – 3dGrabber Jun 03 '19 at 14:09
  • This is true, but also holds for addition... You could maybe combine it with rotations. – Thom Wiggers Jun 04 '19 at 11:46
  • @3dGrabber: Sets and dictionaries do not drop values because their hashes are the same. *Hashes are not required to be unique*. Sets and dictionaries are hash tables and pick slots based on the hash modulus the hash table size, so clashes can and do happen anyway. You only lose values if you produce different hashes for objects that test as *equal*, either by having an `__eq__` implementation that doesn't use the same (subset of) state as `__hash__`, or by having altered the state used in `__eq__` and `__hash__` after storing in a set or dictionary. – Martijn Pieters Jul 06 '19 at 14:26
  • @3dGrabber: so all that using `XOR` with no other information does is reduce the number of possible hashes, increasing the probability of collisions. – Martijn Pieters Jul 06 '19 at 14:27
  • 1
    @ThomWiggers: This is what the elder of SO recommends: https://stackoverflow.com/a/1646913/141397 (c#) "_no-one really knows why it works well ..._" – 3dGrabber Jul 09 '19 at 10:12
  • @MartijnPieters you're right, it's not quite as dramatic as I wrote it in that comment. Unfortunately I can no longer edit it. – 3dGrabber Jul 09 '19 at 11:52
  • @3dGrabber: it works well because it uses prime numbers, so reducing the chances there'll be overlapping powers of 2 dramatically. Which is why xxHash uses prime numbers too (albeit bigger primes) and the Python `tuple()` type uses a simplified xxHash. – Martijn Pieters Jul 09 '19 at 17:20
  • @3dGrabber: ah, the 'nobody knows' part is about why a hash that in *theory* shouldn't do well does well anyway. That's because real-life data and what theory expected do not match, the same reason why TimSort works so well (real data is not nearly so random as theoretical datasets), and why [Python's dict implementation focuses on good collision handling over good hash distribution](https://github.com/python/cpython/blob/4016cd533d7749dcb92197067b390475f98b8e51/Objects/dictobject.c#L134-L221). – Martijn Pieters Jul 09 '19 at 17:38
3

Instead of combining multiple strings together, use tuples as they are hashable in python.

t: Tuple[str, str, int] = ('Field1', 'Field2', 33)
print(t.__hash__())

This will make the code also easier to read.

Tobias Ernst
  • 4,214
  • 1
  • 32
  • 30
  • 1
    I appreciate the simplicity of this answer, the other answers provide excellent detail, but this answer is valuable for getting to an answer quickly and needs more upvotes. – David Parks Oct 18 '21 at 17:51
  • However I think `print(t.__hash__())` would be better presented as `print(hash(t))`. – David Parks Oct 18 '21 at 17:52
-2

For anyone reading this, XORing hashes is a bad idea because it is possible for a particular sequence of duplicate hash values to XOR together and effectively remove an element from the hash set.

For instance:

(hash('asd') ^ hash('asd') ^ hash('derp')) == hash('derp')

and even:

(hash('asd') ^ hash('derp') ^ hash('asd')) == hash('derp')

So if you're using this technique to figure out if a certain set of values is in the combined hash, where you could potentially have duplicate values added to the hash, using XOR can result in that values removal from the set. Instead you should consider OR, which has the same properties of avoiding unbounded integer growth that the previous poster mentioned, but ensures duplicates are not removed.

(hash('asd') | hash('asd') | hash('derp')) != hash('derp')

If you're looking to explore this more, you should look up Bloom filters.

  • `&` masks, so with multiple values, you keep dropping bits. You are *lowering the entropy significantly* and so are now worse off. I’m not sure why you mention bloom filters, as they are essentially just XORed hashes... – Martijn Pieters Jul 06 '19 at 14:21
  • Blooms are not XORed - for reasons that I describe above. How can you be assured that in your proposed technique (XORing hashes) that an element is definitely in the set? – Hashes are probabilistic sets Jul 29 '19 at 18:19
  • Sorry, I was indeed wrong. Bloom filter uses OR on the hash results (setting bits, not resetting). – Martijn Pieters Jul 29 '19 at 18:33