0

Yes, you read the title correctly. I'm trying to figure out why the built-in hash() function in Python would return a different digest for the same input?

This is the code that computes the hash:

    # element is an instance of a typing.NamedTuple
    def compute_hash(self, element):
        values = self.get_key_values_tuple(element)
        _hash = hash(values)
        logging.info(f'Hash of {values} is {_hash}')
        return _hash

    # self.keys in this instance is ['session_id', 'time', 'wearable_id']
    def get_key_values_tuple(self, element: tuple) -> tuple:
        return tuple(map(lambda key: getattr(element, key), self.keys))

This code generates these results: hash results

and generates these logs: code logs

Keep in mind that this code works perfectly on other datasets with the same input data types, it also works intermittently on this dataset (i.e. sometimes the hash is the same for the same input triplets, sometimes it's different for the same input triplets).

A bit more context:

Using Python 3.8.

I'm building Apache Beam components that run on GCP Dataflow. This means that the code can be executed on different machines, but the VM/Container in which it's being executed is always the same (e.g. exact same environment).

Simon Corcos
  • 962
  • 14
  • 31
  • 4
    "This means that the code can be executed on different machines" - `hash` values are not intended to be consistent across different Python processes, let alone on different machines. – user2357112 Oct 13 '21 at 22:15
  • Really? Do you have documentation on that? I've been looking for a reason. Feel free to answer the question. – Simon Corcos Oct 13 '21 at 22:16
  • 2
    Is this the same as https://stackoverflow.com/questions/27522626/hash-function-in-python-3-3-returns-different-results-between-sessions ? – Matt Cliff Oct 13 '21 at 22:18
  • 1
    `hash` is meant for hashing based containers, e.g. dict and set. It isn't guaranteed to be unique across python processes, and indeed, is often purposefully randomized – juanpa.arrivillaga Oct 13 '21 at 22:24
  • Well, now I know. Thanks, guys – Simon Corcos Oct 13 '21 at 22:26
  • `hash()` is made to be fast (it is used a lot within python code). Using `hashlib` you can solve all problems (also the DoS vulnerability of hash, but it would slow down too much Python, if it a strong hash function would be used instead of `hash()`. – Giacomo Catenazzi Oct 14 '21 at 07:49

0 Answers0