2

I am trying to understand python set hashing in detail. I have the following script:

import hashlib

ss = {'a', 'b', 'c', 'd', 'e', 'f', 'g'}
print(ss)
print(hashlib.md5(str(ss).encode()).hexdigest())

ss = {'a', 'b', 'c', 'd', 'e', 'f', 'g'}
print(ss)
print(hashlib.md5(str(ss).encode()).hexdigest())

I see that the resulting set ss is same within a script run, however, it's different across different script runs.

Output of 1st run:

{'e', 'a', 'd', 'b', 'g', 'f', 'c'}
71ac8e6e505b445790b9f943d67ae8e9
{'e', 'a', 'd', 'b', 'g', 'f', 'c'}
71ac8e6e505b445790b9f943d67ae8e9

Output of 2nd run:

{'b', 'e', 'g', 'd', 'f', 'c', 'a'}
274892f97127704494b6b80face616fd
{'b', 'e', 'g', 'd', 'f', 'c', 'a'}
274892f97127704494b6b80face616fd

I am trying to understand, why the outputs are same within a run, but different across runs?

Thanks!

Potato_head
  • 336
  • 1
  • 4
  • 15
  • This is not a duplicate of https://stackoverflow.com/q/27522626/, which is about the builtin ```hash``` function. This question has to do with ```hashlib```, which it's reasonable to think would have a stable hash implementation. And it does—but turning ```set```s into ```str```s to pass to ```hashlib```'s functions is _not_ stable across executions. – Attila the Fun Aug 16 '23 at 18:10

1 Answers1

3

This happens because set()'s hashmap relies on Python's builtin hashing, which uses a random seed by default as of 3.3 .. this results in a different ordering between runs, and so a different string being hashed by md5 (or any other hashing algorithm you pass it to)

This is intended to provide protection against a denial-of-service caused by carefully chosen inputs that exploit the worst case performance of a dict insertion, O(n2) complexity.

from the object.__hash__ method docs

ti7
  • 16,375
  • 6
  • 40
  • 68