2

Is the hash() function in python guaranteed to always be the same for a given input, regardless of when/where it's entered? So far -- from trial-and-error only -- the answer seems to be yes, but it would be nice to understand the internals of how this works. For example, in a test:

$ python
>>> from ingest.tpr import *
>>> d=DailyPriceObj(date="2014-01-01") 
>>> hash(d)
5440882306090652359
>>> ^D
$ python
>>> from ingest.tpr import *
>>> d=DailyPriceObj(date="2014-01-01") 
>>> hash(d)
5440882306090652359
David542
  • 104,438
  • 178
  • 489
  • 842
  • 1
    I don't think it would be useful as a hash function if it weren't. – Fred Larson Oct 10 '19 at 20:11
  • 1
    The only reference to `ingest.tpr` I can find is [your question from yesterday](https://stackoverflow.com/questions/58310753/default-str-method-for-an-instance), so I assume this is an internal package? If so we can't tell you; look at how [`__hash__`](https://docs.python.org/3/reference/datamodel.html#object.__hash__) is implemented. – jonrsharpe Oct 10 '19 at 20:17
  • 2
    Ah, from the accepted answer to that question, "Note that the hash of a value only needs to be the same for one run of Python. In Python 3.3 they will in fact change for every new run of Python" – Fred Larson Oct 10 '19 at 20:17
  • 1
    Possible duplicate of [What does hash do in python?](https://stackoverflow.com/questions/17585730/what-does-hash-do-in-python) – Fred Larson Oct 10 '19 at 20:18

3 Answers3

7

The contract for the __hash__ method requires that it be consistent within a given run of Python. There is no guarantee that it be consistent across different runs of Python, and in fact, for the built-in str, bytes-like types, and datetime.datetime objects (possibly others), the hash is salted with a per-run value so that it's almost never the same for the same input in different runs of Python.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • any idea of why they decided to do it like that so the hashes are sometimes different? – David542 Oct 10 '19 at 20:23
  • 2
    @David542: To make it harder for people to create a DoS attack on a Python interpreter by deliberately creating a bunch of strings with hash collisions. – dan04 Oct 10 '19 at 20:27
  • @David542 that's covered in the `__hash__` docs I linked above, which are also linked from the `hash` docs. – jonrsharpe Oct 10 '19 at 20:28
  • In addition, the hash() method will truncate the hash depending on the architecture. – Simon Oct 10 '19 at 20:41
1

No, it's dependent on the process. If you need a persistent hash, see Persistent Hashing of Strings in Python.

Truncation depending on platform, from the documentation of __hash__:

hash() truncates the value returned from an object’s custom __hash__() method to the size of a Py_ssize_t. This is typically 8 bytes on 64-bit builds and 4 bytes on 32-bit builds.

Salted hashes, from the same documentation (ShadowRanger's answer):

By default, the __hash__() values of str, bytes and datetime objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python. This is intended to provide protection against a denial-of-service caused by carefully-chosen inputs that exploit the worst case performance of a dict insertion, O(n^2) complexity. See http://www.ocert.org/advisories/ocert-2011-003.html for details.

Simon
  • 774
  • 4
  • 21
1

A necessary condition for hashability is that for equivalent objects this value is always the same (inside one run of the interpreter).

Of course, nothing prevents you from ignoring this requirement. But, if you suddently want to store your objects in a dictionary or a set, then problems may arise.

When you implement your own class you can define methods __eq__ and __hash__. I have used a polynominal hash function for strings and a hash function from a universal family of hash functions.

In general, the hash values for a specific object shouldn't change from start to start interpreter. But for many data types this is true. One of the reasons for this implementation is that it is more difficult to find and anti-hash test.

For example:

For numeric types, the hash of a number x is based on the reduction of x modulo the prime P = 2**_PyHASH_BITS - 1. It's designed so that hash(x) == hash(y) whenever x and y are numerically equal, even if x and y have different types.

a = 123456789
hash(a) == 123456789
hash(a + b * (2 ** 61 - 1)) == 123456789