8

I've been using pickle.dumps in order to create a hash for an arbitrary Python object, however, I've found out that dict/set orders aren't canonicalized and the result is therefore unreliable.

There are several related questions on SO and elsewhere, but I can't seem to find a hashing algorithm that uses the same basis for equality (__getstate__/__dict__ results). I understand the basic requirements for rolling my own, but obviously I'd much prefer to use something that's been tested.

Does such a library exist? I suppose what I'm actually asking for is a library that serializes objects deterministically (using __getstate__ and __dict__) so that I can hash the output.

EDIT

To clarify, I'm looking for something different than the values returned by Python's hash (or __hash__). What I want is essentially a checksum for arbitrary objects which may or may not be hashable. This value should vary based on objects' state. (I'm using "state" to refer to the dict retuned by __getstate__ or, if that's not present, the object's __dict__.)

Community
  • 1
  • 1
matthewwithanm
  • 3,733
  • 2
  • 21
  • 27
  • 3
    It may be useful to understand what you want the hash for. – arikb Apr 23 '13 at 02:40
  • It's essentially for memoization. I'm generating files using a set of parameters from which the filenames need to be determined. However, I'm interested in a general solution to the problem. – matthewwithanm Apr 23 '13 at 16:34
  • Note that you can to tell apart two "equal" objects by the iteration order of `__dict__`! – tc. Apr 23 '13 at 23:45

2 Answers2

1

It occurred to me that Pickler can be extended and the select functions overridden to canonicalize the necessary types, so that's what I'm doing. Here's what it looks like:

from copy import copy
from pickle import Pickler, MARK, DICT
from types import DictionaryType


class CanonicalizingPickler(Pickler):
    dispatch = copy(Pickler.dispatch)

    def save_set(self, obj):
        rv = obj.__reduce_ex__(0)
        rv = (rv[0], (sorted(rv[1][0]),), rv[2])
        self.save_reduce(obj=obj, *rv)

    dispatch[set] = save_set

    def save_dict(self, obj):
        write = self.write
        write(MARK + DICT)

        self.memoize(obj)
        self._batch_setitems(sorted(obj.iteritems()))

    dispatch[DictionaryType] = save_dict
matthewwithanm
  • 3,733
  • 2
  • 21
  • 27
0

I'm going to assume you're going to treat your object as immutable once you've computed (and stored) their hash values. Otherwise, you should be very careful with what you are doing (e.g. shouldn't use their hashability-quality for storing them in sets, dicts, etc.).

That said, the most elegant way is to store all the members in your object's __dict__ in hashable types, in the first place. Instead of lists, use tuples (of hashable objects, of course). Instead of dicts, use any of the solutions to this question as your hashable-dict type (I'm personally using @alex's). Again, both the keys and the values must be hashable in order for this to work.

Then, your __hash__ method can use the same hashable-dict you're using, like:

def _hashable_state(self):
    return MyHashableDict(self.__dict__)
def __hash__(self):
    return hash(self._hashable_state())
def __reduce__(self):
    return self._hashable_state()
Community
  • 1
  • 1
shx2
  • 61,779
  • 13
  • 130
  • 153
  • Yes, you're right about treating my objects as immutable. However I'm looking for something that will work for arbitrary (pickleable) objects. That is, two objects should have the same hash iff all the items in their state dictionary (`__dict__` or `__getstate__()` result) have the same hash (the recursive case), or (for primitives or other objects without `__dict__` properties or `__getstate__`) are equal. – matthewwithanm Apr 23 '13 at 16:50