Python consistent hash replacement

Question

As noted by many, Python's hash is not consistent anymore (as of version 3.3), as a random PYTHONHASHSEED is now used by default (to address security concerns, as explained in this excellent answer).

However, I have noticed that the hash of some objects are still consistent (as of Python 3.7 anyway): that includes int, float, tuple(x), frozenset(x) (as long as x yields consistent hash). For example:

assert hash(10) == 10
assert hash((10, 11)) == 3713074054246420356
assert hash(frozenset([(0, 1, 2), 3, (4, 5, 6)])) == -8046488914261726427

Is that always true and guaranteed? If so, is that expected to stay that way? Is the PYTHONHASHSEED only applied to salt the hash of strings and byte arrays?

Why am I asking?

I have a system that relies on hashing to remember whether or not we have seen a given dict (in any order): {key: tuple(ints)}. In that system, the keys are a collection of filenames and the tuples a subset of os.stat_result, e.g. (size, mtime) associated with them. This system is used to make update/sync decisions based on detecting differences.

In my application, I have on the order of 100K such dicts, and each can represent several thousands of files and their state, so the compactness of the cache is important.

I can tolerate the small false positive rate (< 10^-19 for 64-bit hashes) coming from possible hash collisions (see also birthday paradox).

One compact representation is the following for each such dict "fsd":

def fsd_hash(fsd: dict):
    return hash(frozenset(fsd.items()))

It is very fast and yields a single int to represent an entire dict (with order-invariance). If anything in the fsd dict changes, with high probability the hash will be different.

Unfortunately, hash is only consistent within a single Python instance, rendering it useless for hosts to compare their respective hashes. Persisting the full cache ({location_name: fsd_hash}) to disk to be reloaded on restart is also useless.

I cannot expect the larger system that uses that module to have been invoked with PYTHONHASHSEED=0, and, to my knowledge, there is no way to change this once the Python instance has started.

Things I have tried

I may use hashlib.sha1 or similar to calculate consistent hashes. This is slower and I can't directly use the frozenset trick: I have to iterate through the dict in a consistent order (e.g. by sorting on keys, slow) while updating the hasher. In my tests on real data, I see over 50x slow-down.
I could try applying an order-invariant hashing algorithm on consistent hashes obtained for each item (also slow, as starting a fresh hasher for each item is time-consuming).
I can try transforming everything into ints or tuples of ints and then frozensets of such tuples. At the moment, it seems that all int, tuple(int) and frozenset(tuple(int)) yield consistent hashes, but: is that guaranteed, and if so, how long can I expect this to be the case?

Additional question: more generally, what would be a good way to write a consistent hash replacement for hash(frozenset(some_dict.items())) when the dict contains various types and classes? I can implement a custom __hash__ (a consistent one) for the classes I own, but I cannot override str's hash for example. One thing I came up with is:

def const_hash(x):
    if isinstance(x, (int, float, bool)):
        pass
    elif isinstance(x, frozenset):
        x = frozenset([const_hash(v) for v in x])
    elif isinstance(x, str):
        x = tuple([ord(e) for e in x])
    elif isinstance(x, bytes):
        x = tuple(x)
    elif isinstance(x, dict):
        x = tuple([(const_hash(k), const_hash(v)) for k, v in x.items()])
    elif isinstance(x, (list, tuple)):
        x = tuple([const_hash(e) for e in x])
    else:
        try:
            return x.const_hash()
        except AttributeError:
            raise TypeError(f'no known const_hash implementation for {type(x)}')
    return hash(x)

Why do you need the hash to be consistent across runs? This smells of [an XY problem](https://meta.stackexchange.com/q/66377/322040). `dict`s aren't even hashable, so "consistent hashes of dicts" is nonsensical, and any code relying on hashes to be consistent between runs misunderstands the purpose of hashes in the first place. It also makes me suspicious you're expecting hashes to be unique, which is not a guarantee at all. — ShadowRanger, Oct 13 '20 at 23:40
I'm slightly confused. Is your program writing data to an external database or reading data from an external data base? Are you being called from other programs? Why do you need consistency across runs? — Frank Yellin, Oct 14 '20 at 02:26
@ShadowRanger and @FrankYellin: I rewrote the question for clarity. Also, of course you can calculate a hash for a dict: order-invariant: `hash(frozenset(d.items())`, order-sensitive: `hash(tuple(d.items())`. — Pierre D, Oct 14 '20 at 13:55
You should use a standard serialisation of your data, and use a hash from hashlib on that instead. Pickle serialization should suffice fo that. — jsbueno, Oct 14 '20 at 15:42
@jsbueno: I don't think that'll work; he wants order-insensitive hashing for `frozenset` and `dict`, but the result of pickling them will be order-sensitive (the items must be written out in a specific order that may differ even when the collections are equal to each other, due to different order of construction). — ShadowRanger, Oct 14 '20 at 15:44
Related (just putting the link here so that StackOverflow links the question, because my initial Google search found the other question first), `hash` is non-persistent: https://stackoverflow.com/questions/58330403/is-pythons-hash-persistent — Albert, Sep 19 '22 at 07:33
Related, `hash` is non-persistent: https://stackoverflow.com/questions/27522626/hash-function-in-python-3-3-returns-different-results-between-sessions — Albert, Sep 19 '22 at 07:34
Related, persistent hash from tuple: https://stackoverflow.com/questions/53939754/python-how-can-i-have-a-persistent-hash-from-a-tuple — Albert, Sep 19 '22 at 07:35
Related, persistent hash from nested frozenset: https://stackoverflow.com/questions/8639786/persistent-hashing-of-python-frozen-sets — Albert, Sep 19 '22 at 07:37
Related, persistent hash for arbitrary objects: https://stackoverflow.com/questions/38009699/alternative-to-python-hash-function-for-arbitrary-objects — Albert, Sep 19 '22 at 09:12

ShadowRanger · Accepted Answer · 2020-10-14T15:39:03.090

Short answer to broad question: There are no explicit guarantees made about hashing stability aside from the overall guarantee that x == y requires that hash(x) == hash(y). There is an implication that x and y are both defined in the same run of the program (you can't perform x == y where one of them doesn't exist in that program obviously, so no guarantees are needed about the hash across runs).

Longer answers to specific questions:

Is [your belief that int, float, tuple(x), frozenset(x) (for x with consistent hash) have consistent hashes across separate runs] always true and guaranteed?

It's true of numeric types, with the mechanism being officially documented, but the mechanism is only guaranteed for a particular interpreter for a particular build. sys.hash_info provides the various constants, and they'll be consistent on that interpreter, but on a different interpreter (CPython vs. PyPy, 64 bit build vs. 32 bit build, even 3.n vs. 3.n+1) they can differ (documented to differ in the case of 64 vs. 32 bit CPython), so the hashes won't be portable across machines with different interpreters.

No guarantees on algorithm are made for tuple and frozenset; I can't think of any reason they'd change it between runs (if the underlying types are seeded, the tuple and frozenset benefit from it without needing any changes), but they can and do change the implementation between releases of CPython (e.g. in late 2018 they made a change to reduce the number of hash collisions in short tuples of ints and floats), so if you store off the hashes of tuples from say, 3.7, and then compute hashes of the same tuples in 3.8+, they won't match (even though they'd match between runs on 3.7 or between runs on 3.8).

If so, is that expected to stay that way?

Expected to, yes. Guaranteed, no. I could easily see seeded hashes for ints (and by extension, for all numeric types to preserve the numeric hash/equality guarantees) for the same reason they seeded hashes for str/bytes, etc. The main hurdles would be:

It would almost certainly be slower than the current, very simple algorithm.
By documenting the numeric hashing algorithm explicitly, they'd need a long period of deprecation before they could change it.
It's not strictly necessary (if web apps need seeded hashes for DoS protection, they can always just convert ints to str before using them as keys).

Is the PYTHONHASHSEED only applied to salt the hash of strings and byte arrays?

Beyond str and bytes, it applies to a number of random things that implement their own hashing in terms of the hash of str or bytes, often because they're already naturally convertable to raw bytes and are commonly used as keys in dicts populated by web-facing frontends. The ones I know of off-hand include the various classes of the datetime module (datetime, date, time, though this isn't actually documented in the module itself), and read-only memoryviews of with byte-sized formats (which hash equivalently to hashing the result of the view's .tobytes() method).

What would be a good way to write a consistent hash replacement for hash(frozenset(some_dict.items())) when the dict contains various types and classes?

The simplest/most composable solution would probably be to define your const_hash as a single dispatch function, using it the same way you do hash itself. This avoids having one single function defined in a single place that must handle all types; you can have the const_hash default implementation (which just relies on hash for those things with known consistent hashes) in a central location, and provide additional definitions for the built-in types you know aren't consistent (or which might contain inconsistent stuff) there, while still allowing people to extend the set of things it covers seamlessly by registering their own single-dispatch functions by importing your const_hash and decorating the implementation for their type with @const_hash.register. It's not significantly different in effect from your proposed const_hash, but it's a lot more manageable.

fantastic response, and the last paragraph in particular is exactly what I was looking for. I didn't know about single dispatch function. That makes for a much nicer design indeed: composable and extensible. — Pierre D, Oct 14 '20 at 22:54

Python consistent hash replacement

1 Answers1

Linked