As noted by many, Python's hash
is not consistent anymore (as of version 3.3), as a random PYTHONHASHSEED
is now used by default (to address security concerns, as explained in this excellent answer).
However, I have noticed that the hash of some objects are still consistent (as of Python 3.7 anyway): that includes int
, float
, tuple(x)
, frozenset(x)
(as long as x
yields consistent hash). For example:
assert hash(10) == 10
assert hash((10, 11)) == 3713074054246420356
assert hash(frozenset([(0, 1, 2), 3, (4, 5, 6)])) == -8046488914261726427
Is that always true and guaranteed? If so, is that expected to stay that way? Is the PYTHONHASHSEED
only applied to salt the hash of strings and byte arrays?
Why am I asking?
I have a system that relies on hashing to remember whether or not we have seen a given dict (in any order): {key: tuple(ints)}
. In that system, the keys are a collection of filenames and the tuples a subset of os.stat_result
, e.g. (size, mtime)
associated with them. This system is used to make update/sync decisions based on detecting differences.
In my application, I have on the order of 100K such dicts, and each can represent several thousands of files and their state, so the compactness of the cache is important.
I can tolerate the small false positive rate (< 10^-19 for 64-bit hashes) coming from possible hash collisions (see also birthday paradox).
One compact representation is the following for each such dict "fsd
":
def fsd_hash(fsd: dict):
return hash(frozenset(fsd.items()))
It is very fast and yields a single int to represent an entire dict (with order-invariance). If anything in the fsd
dict changes, with high probability the hash will be different.
Unfortunately, hash
is only consistent within a single Python instance, rendering it useless for hosts to compare their respective hashes. Persisting the full cache ({location_name: fsd_hash}
) to disk to be reloaded on restart is also useless.
I cannot expect the larger system that uses that module to have been invoked with PYTHONHASHSEED=0
, and, to my knowledge, there is no way to change this once the Python instance has started.
Things I have tried
I may use
hashlib.sha1
or similar to calculate consistent hashes. This is slower and I can't directly use thefrozenset
trick: I have to iterate through the dict in a consistent order (e.g. by sorting on keys, slow) while updating the hasher. In my tests on real data, I see over 50x slow-down.I could try applying an order-invariant hashing algorithm on consistent hashes obtained for each item (also slow, as starting a fresh hasher for each item is time-consuming).
I can try transforming everything into ints or tuples of ints and then frozensets of such tuples. At the moment, it seems that all
int
,tuple(int)
andfrozenset(tuple(int))
yield consistent hashes, but: is that guaranteed, and if so, how long can I expect this to be the case?
Additional question: more generally, what would be a good way to write a consistent hash replacement for hash(frozenset(some_dict.items()))
when the dict contains various types and classes? I can implement a custom __hash__
(a consistent one) for the classes I own, but I cannot override str
's hash for example. One thing I came up with is:
def const_hash(x):
if isinstance(x, (int, float, bool)):
pass
elif isinstance(x, frozenset):
x = frozenset([const_hash(v) for v in x])
elif isinstance(x, str):
x = tuple([ord(e) for e in x])
elif isinstance(x, bytes):
x = tuple(x)
elif isinstance(x, dict):
x = tuple([(const_hash(k), const_hash(v)) for k, v in x.items()])
elif isinstance(x, (list, tuple)):
x = tuple([const_hash(e) for e in x])
else:
try:
return x.const_hash()
except AttributeError:
raise TypeError(f'no known const_hash implementation for {type(x)}')
return hash(x)