4

Python 3's default hashing function(s) isn't deterministic (hash(None) varies from run to run), and doesn't even make a best effort to generate unique id's with high probability (hash(-1)==hash(-2) is True).

Is there some other hash function that works well as a checksum (i.e. negligible probability of two data structures hashing to the same value, and returns the same result each run of python), and supports all of python's built-in datatypes, including None?

Ideally it would be in the standard library. I can pickle the object or get a string representation, but that seems unnecessarily hacky, and string representations of floats are probably very bad checksums.

I found the cryptographic hashes (md5,sha256) in the standard library, but they only operate on bytestrings.

Haskell seems to get this ~almost right in their standard library... but "Nothing::Maybe Int" and 0 both hash to 0, so it's not perfect there either.

Andrew Wagner
  • 22,677
  • 21
  • 86
  • 100
  • In Python 2, hashing is deterministic. It was made to introduce some random factor after a security exploit using many known hash collisions, for example to a server, as a denial of service attack. See [here](http://www.ocert.org/advisories/ocert-2011-003.html) – Artyer Jul 10 '17 at 14:50
  • 1
    Hmm. `hash(-1)` being `-2` is really strange. From -100,000,000 to 100,000,000, that's the only int whose hash value differs from itself. Weird. I wonder how/why they did that. – Stefan Pochmann Jul 10 '17 at 14:54
  • 1
    @StefanPochmann In the CPython source, returning `-1` is reserved for errors, so `-1` is converted to `-2` for returning the hash. – Artyer Jul 10 '17 at 14:55
  • 1
    @Artyer Yeah, I just found https://stackoverflow.com/questions/10130454/why-do-1-and-2-both-hash-to-2-in-cpython – Stefan Pochmann Jul 10 '17 at 14:56
  • @Stefan, thanks for finding the reason. It sounds pretty silly to break hash, but I'm not familiar with the cPython code. Seems like at the C level, hash should be using a function argument, not the return value, to return the result. At least they didn't apply this logic to the math functions! – Andrew Wagner Jul 10 '17 at 15:28
  • I thought `pprint` would help, but then realized it fails for sets. Turned that into a [new question](https://stackoverflow.com/questions/45016455/pprint-sorting-dicts-but-not-sets). – Stefan Pochmann Jul 10 '17 at 16:03

1 Answers1

6

You can use any hash from hashlib on a pickled object. pickle.dumps not suitable for hashing.

You can use sorted-keys json with hashlib.

hashlib.md5(json.dumps(data, sort_keys=True)).hexdigest()

Taken from: https://stackoverflow.com/a/10288255/3858507, according to AndrewWagner's comment.

By the way and only for reference as this causes security vulnerabitilies, the PYTHONHASHSEED environment variable can be used to disable randomization of hashes throughout your application.

nitzpo
  • 497
  • 2
  • 6
  • Thanks! I'll try it out. Is pickle a pure function of your data structure? Or is it sticking python version strings, timestamps, etc... in the serialized data? – Andrew Wagner Jul 10 '17 at 15:02
  • 1
    I get different results for `hashlib.md5(pickle.dumps({0, 8})).hexdigest()` and `hashlib.md5(pickle.dumps({8, 0})).hexdigest()`. – Stefan Pochmann Jul 10 '17 at 15:05
  • Yes I tested it now too. The problem is that the pickle output itself changes because of the hash randomization. You could use `json.dumps` if it suits your needs; it supports dicts but not sets. – nitzpo Jul 10 '17 at 15:07
  • @nitzpo Are you sure the hash randomization is the reason? Shouldn't that only apply to different runs of the Python script? I get a difference even if I do them in a single expression: `pickle.dumps({0, 8}) == pickle.dumps({8, 0})` gives me `False`. – Stefan Pochmann Jul 10 '17 at 15:10
  • Ok now I tried `json`... for `json.dumps({0})` I get `TypeError: set([0]) is not JSON serializable` and for `json.dumps({0: 0, 8: 8}) == json.dumps({8: 8, 0: 0})` I get `False`. – Stefan Pochmann Jul 10 '17 at 15:15
  • @Stefan, I'm not surprised dictionary types don't work when you change the order; I was expecting to at least need to use OrderedDict or FrozenDict or namedtuple for key,value stuff. – Andrew Wagner Jul 10 '17 at 15:31
  • This related question also uses json, which supports NaN's, and sorting dictionary entries by key. tuples and lists are serialized the same way. – Andrew Wagner Jul 10 '17 at 16:05
  • json also treats namedtuples as tuples; sort_keys seemingly has no effect, and the type of the namedtuple doesn't get serialized... So there are some issues to be aware of with collisions between instances of several built-in / standard library types... – Andrew Wagner Jul 10 '17 at 16:13
  • 1
    https://stackoverflow.com/questions/5417949/computing-an-md5-hash-of-a-data-structure – Andrew Wagner Jul 10 '17 at 16:14
  • I like the solution from @AndrewWagner 's link with the `json.dumps(data, sort_keys=True)` – nitzpo Jul 10 '17 at 19:52
  • `sort_keys=True` kicked my ass, especially since `pprint` sorts `dicts` before printing them. Learning is fun. – Nathan Chappell May 21 '20 at 08:28