7

[Python 3.1]

I am trying to create a hash for a container that may have nested containers in it, with unknown depth. At all levels of the hierarchy, there are only built-in types. What's a good way to do that?

Why I need it:

I am caching the result of some calculations in a pickle object (on disk). I would need to invalidate that cached file if the function is called with different parameters (this happens rarely, so I'm not going to save more than one file to disk). The hash will be used to compare the parameters.

pyfunc
  • 65,343
  • 15
  • 148
  • 136
max
  • 49,282
  • 56
  • 208
  • 355
  • Are you expecting these values to be able to change after you create them? – aaronasterling Nov 17 '10 at 06:39
  • @aaronsterling: Can you clarify please? I'm not sure I'm answering your question, but the containers or their contents will not be modified. However, I have to emphasize that the persistence is, of course, required for any future invocation of this Python program (in a new process). – max Nov 17 '10 at 06:45
  • So you're OK with a small chance that the file might not be invalidated by different function parameters? Any hash function will have a chance of collisions, so you're looking for one with a minimal collision chance. Just trying to understand the problem a bit better. – Scott Griffiths Nov 17 '10 at 17:23
  • @Scott Griffiths Yes I'm OK with a small chance of this problem happening. Although I guess this suggests serialization is a safer approach than hash when the parameters are not huge. – max Nov 18 '10 at 00:36
  • Related: https://stackoverflow.com/questions/64344515/python-consistent-hash-replacement – Albert Sep 19 '22 at 07:43

3 Answers3

2

If all the containers are tuples, and all the contained objects are hashable, then the main container should be hashable.

Paul
  • 42,322
  • 15
  • 106
  • 123
1

You could just serialize the parameters into something like JSON, and use that for the hash.

Gerrat
  • 28,863
  • 9
  • 73
  • 101
  • I like this idea. Are there any hidden problems with that? E.g., would it break on some complicated nesting structures, etc? – max Nov 18 '10 at 00:39
  • No, so long as it's just tuples, lists, dicts, strs and ints you're fine with it. One caveat, however: `tuple` == `list` (as JavaScript doesn't have a `tuple` concept) – Chris Morgan Nov 18 '10 at 00:57
  • 2
    I think I'd use `pickle.dumps` instead of `json.dumps`; the pickle is precise with types and ~15% faster (I timed them). – Chris Morgan Nov 18 '10 at 01:01
  • 1
    @ChrisMorgan I just realized I asked a [related question](http://stackoverflow.com/questions/12727571/using-pickle-dumps-to-hash-mutable-objects) 2 years later... And a comment and an answer to it said that I should *not* use `pickle` because it's both unpredictable for non-sorted objects and slower than manual conversion to hashable objects. One day I'll time different approaches. – max Oct 31 '12 at 09:22
0

I would do it with json serialization as a string [and then hash that string if it's still necessary].

from simplejson import dumps

def hash_data(data):
    return hash(dumps(data))
vonPetrushev
  • 5,457
  • 6
  • 39
  • 51