123

As an exercise, and mostly for my own amusement, I'm implementing a backtracking packrat parser. The inspiration for this is i'd like to have a better idea about how hygenic macros would work in an algol-like language (as apposed to the syntax free lisp dialects you normally find them in). Because of this, different passes through the input might see different grammars, so cached parse results are invalid, unless I also store the current version of the grammar along with the cached parse results. (EDIT: a consequence of this use of key-value collections is that they should be immutable, but I don't intend to expose the interface to allow them to be changed, so either mutable or immutable collections are fine)

The problem is that python dicts cannot appear as keys to other dicts. Even using a tuple (as I'd be doing anyways) doesn't help.

>>> cache = {}
>>> rule = {"foo":"bar"}
>>> cache[(rule, "baz")] = "quux"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'dict'
>>> 

I guess it has to be tuples all the way down. Now the python standard library provides approximately what i'd need, collections.namedtuple has a very different syntax, but can be used as a key. continuing from above session:

>>> from collections import namedtuple
>>> Rule = namedtuple("Rule",rule.keys())
>>> cache[(Rule(**rule), "baz")] = "quux"
>>> cache
{(Rule(foo='bar'), 'baz'): 'quux'}

Ok. But I have to make a class for each possible combination of keys in the rule I would want to use, which isn't so bad, because each parse rule knows exactly what parameters it uses, so that class can be defined at the same time as the function that parses the rule.

Edit: An additional problem with namedtuples is that they are strictly positional. Two tuples that look like they should be different can in fact be the same:

>>> you = namedtuple("foo",["bar","baz"])
>>> me = namedtuple("foo",["bar","quux"])
>>> you(bar=1,baz=2) == me(bar=1,quux=2)
True
>>> bob = namedtuple("foo",["baz","bar"])
>>> you(bar=1,baz=2) == bob(bar=1,baz=2)
False

tl'dr: How do I get dicts that can be used as keys to other dicts?

Having hacked a bit on the answers, here's the more complete solution I'm using. Note that this does a bit extra work to make the resulting dicts vaguely immutable for practical purposes. Of course it's still quite easy to hack around it by calling dict.__setitem__(instance, key, value) but we're all adults here.

class hashdict(dict):
    """
    hashable dict implementation, suitable for use as a key into
    other dicts.

        >>> h1 = hashdict({"apples": 1, "bananas":2})
        >>> h2 = hashdict({"bananas": 3, "mangoes": 5})
        >>> h1+h2
        hashdict(apples=1, bananas=3, mangoes=5)
        >>> d1 = {}
        >>> d1[h1] = "salad"
        >>> d1[h1]
        'salad'
        >>> d1[h2]
        Traceback (most recent call last):
        ...
        KeyError: hashdict(bananas=3, mangoes=5)

    based on answers from
       http://stackoverflow.com/questions/1151658/python-hashable-dicts

    """
    def __key(self):
        return tuple(sorted(self.items()))
    def __repr__(self):
        return "{0}({1})".format(self.__class__.__name__,
            ", ".join("{0}={1}".format(
                    str(i[0]),repr(i[1])) for i in self.__key()))

    def __hash__(self):
        return hash(self.__key())
    def __setitem__(self, key, value):
        raise TypeError("{0} does not support item assignment"
                         .format(self.__class__.__name__))
    def __delitem__(self, key):
        raise TypeError("{0} does not support item assignment"
                         .format(self.__class__.__name__))
    def clear(self):
        raise TypeError("{0} does not support item assignment"
                         .format(self.__class__.__name__))
    def pop(self, *args, **kwargs):
        raise TypeError("{0} does not support item assignment"
                         .format(self.__class__.__name__))
    def popitem(self, *args, **kwargs):
        raise TypeError("{0} does not support item assignment"
                         .format(self.__class__.__name__))
    def setdefault(self, *args, **kwargs):
        raise TypeError("{0} does not support item assignment"
                         .format(self.__class__.__name__))
    def update(self, *args, **kwargs):
        raise TypeError("{0} does not support item assignment"
                         .format(self.__class__.__name__))
    # update is not ok because it mutates the object
    # __add__ is ok because it creates a new object
    # while the new object is under construction, it's ok to mutate it
    def __add__(self, right):
        result = hashdict(self)
        dict.update(result, right)
        return result

if __name__ == "__main__":
    import doctest
    doctest.testmod()
max
  • 49,282
  • 56
  • 208
  • 355
SingleNegationElimination
  • 151,563
  • 33
  • 264
  • 304
  • 1
    The `hashdict` must be immutable, at least after you start hashing it, so why not cache the `key` and `hash` values as attributes of the `hashdict` object? I modified `__key()` and `__hash__()`, and tested to confirm that it is much faster. SO doesn't allow formatted code in comments, so I'll link it here: http://sam.aiki.info/hashdict.py – Sam Watkins Nov 21 '17 at 08:35
  • I do not know if this is maybe too simple, but would hashing the dict with `str()` maybe work? I.e. in your example: `cache[(str(rule), "baz")] = "quux"` – Konstantin Mar 19 '21 at 14:35
  • @Konstantin it needs to be sorted – Erik Aronesty Nov 02 '22 at 13:45

11 Answers11

102

Here is the easy way to make a hashable dictionary. Just remember not to mutate them after embedding in another dictionary for obvious reasons.

class hashabledict(dict):
    def __hash__(self):
        return hash(tuple(sorted(self.items())))
Unknown
  • 45,913
  • 27
  • 138
  • 182
  • 9
    This does not sharply ensure consistency of __eq__ and __hash__ while my earlier answer does through the use of the __key method (in practice either approach should work, though this one might be slowed down by making an unneeded itermediate list -- fixable by s/items/iteritems/ -- assuming Python 2.* as you don't say;-). – Alex Martelli Jul 20 '09 at 04:48
  • @Alex, yes you can fix it with iteritems. I copied and pasted this solution from a google link. As for ensuring consistency of __hash__, there should be no problems. Equality is also not a problem. hashabledict(a=5) == hashabledict(a=5) is true. – Unknown Jul 20 '09 at 05:01
  • Both solutions seem to be about the same, and this is probably the kernel of how I will solve the problem, so I'm accepting yours since you have a lower rep. – SingleNegationElimination Jul 20 '09 at 18:16
  • 7
    It would probably be better to just use a frozenset rather than a tuple with sorting. Not only would this be faster, but you can't assume that dictionary keys are comparable. – asmeurer Aug 25 '12 at 01:14
  • 3
    It seems there should be a way to avoid a hash function that's `O(n*log(n))` where `n` is the number of `dict` entries. Does anyone know if Python's `frozenset` hash function runs in linear time? – Tom Karzes Jun 27 '16 at 14:41
  • 1
    A normal dict is created like so: `{key_a: val_a, key_b: val_b, ...}`. How would you go ahead creating a `hashabledict`? Could you add an example? – HelloGoodbye Oct 04 '18 at 07:44
  • 2
    @HelloGoodbye A dict can also be created like this `dict(key1=value1, key2=value2,...)` or this `dict([(key1, value1), (key2, value2),...)])`. The same applies to this one. The creation you posted is called a *literal* – smido Oct 06 '18 at 20:09
  • 2
    @smido: Thanks. I also found that you can just cast a literal, i.e. `hashabledict({key_a: val_a, key_b: val_b, ...})`. – HelloGoodbye Oct 09 '18 at 07:35
  • Python dictionaries now keep order. The code in this answer considers two dictionaries containing the exact same items but in a different order to be equivalent. Remove the call to `sorted()` if that's not what you want. – Boris Verkhovskiy May 02 '19 at 03:31
  • @BorisVerkhovskiy: Python's `==` operator considers dictionaries with identical contents in a different order to be equivalent, so if you want them to be different I think you also need to implement a different equality method... – psmears Sep 16 '22 at 21:36
  • frozenset is a linear hash: https://stackoverflow.com/questions/20832279/python-frozenset-hashing-algorithm-implementation – Erik Aronesty Nov 02 '22 at 13:49
  • Would it help to override `__setitem__`, `__setslice__`, `__del__`, `__delete__`, `__delslice__` (and maybe others) to raise errors instead of trust that it won't be modified? – dfrankow Apr 12 '23 at 21:37
81

Hashables should be immutable -- not enforcing this but TRUSTING you not to mutate a dict after its first use as a key, the following approach would work:

class hashabledict(dict):
  def __key(self):
    return tuple((k,self[k]) for k in sorted(self))
  def __hash__(self):
    return hash(self.__key())
  def __eq__(self, other):
    return self.__key() == other.__key()

If you DO need to mutate your dicts and STILL want to use them as keys, complexity explodes hundredfolds -- not to say it can't be done, but I'll wait until a VERY specific indication before I get into THAT incredible morass!-)

Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
  • I certainly do not want to mutate the dicts once they have been prepared. That would make the rest of the packrad algorithm fall apart. – SingleNegationElimination Jul 20 '09 at 04:21
  • Then the subclass I suggested will work -- note how it bypasses the "positional" issue (_before_ you had edited your question to point it out;-) with the `sorted` in __key;-). – Alex Martelli Jul 20 '09 at 04:44
  • The position dependent behavior of namedtuple surprised the heck out of me. I had been playing with it, thinking it might still be an easier way to solve the problem, but that pretty much dashed all my hopes (and will require a rollback :( ) – SingleNegationElimination Jul 20 '09 at 18:15
  • Let's say I have a dict and I want to cast it to a hashabledict. How would I do that? – tadasajon Nov 12 '14 at 22:02
  • @JonCrowell see these questions for ideas and clarifications: http://stackoverflow.com/questions/3464061/cast-base-class-to-derived-class-python-or-more-pythonic-way-of-extending-class, http://stackoverflow.com/questions/9112300/how-to-cast-object-in-python, http://stackoverflow.com/questions/18020074/convert-a-baseclass-object-into-a-subclass-object-idiomatically – max Apr 13 '15 at 09:28
  • @AlexMartelli Alex what is the benefit of redefining `__eq__` in the subclass? It would seem that `dict.__eq__` would always return exactly the same value as `hashabledict.__eq__`, while being faster due to the C implementation. What am I missing? I know the docs state that one should define `__eq__` whenever defining `__hash__`, but I don't see why, in this case. (python 3.4 assumed) – max Apr 13 '15 at 10:41
  • @max, I was just following the canonical approach which as you mention is mandated by the docs -- you're probably right that in this case it's not necessary (but of course I couldn't assume py 3.4 back in '09 when I wrote this answer, right?-). Nowadays I'd probably use `collections.Mapping` delegating as little as I can get away with (compatibly with acceptable performance) to an internally-contained `dict`. – Alex Martelli Apr 17 '15 at 10:32
  • It seems there should be a way to avoid a hash function that's `O(n*log(n))` where `n` is the number of `dict` entries. Does anyone know if Python's `frozenset` hash function runs in linear time? – Tom Karzes Jun 27 '16 at 14:41
  • Your __eq__ breaks everything. Unequal dictionaries can now compare equal. Equality with any other type will now raise an exception. – OrangeDog Jan 25 '18 at 18:45
  • @OrangeDog what's an example of it comparing equal to an unequal dictionary? As I read it the `__eq__` defined here will only match on a sorted tuple of (key, value) tuples in this `dict`. It's not relevant that the key is sorted for equality though, that's only for the hashing to be (_almost_ always) unique. – Davos Dec 18 '18 at 01:22
43

All that is needed to make dictionaries usable for your purpose is to add a __hash__ method:

class Hashabledict(dict):
    def __hash__(self):
        return hash(frozenset(self))

Note, the frozenset conversion will work for all dictionaries (i.e. it doesn't require the keys to be sortable). Likewise, there is no restriction on the dictionary values.

If there are many dictionaries with identical keys but with distinct values, it is necessary to have the hash take the values into account. The fastest way to do that is:

class Hashabledict(dict):
    def __hash__(self):
        return hash((frozenset(self), frozenset(self.itervalues())))

This is quicker than frozenset(self.iteritems()) for two reasons. First, the frozenset(self) step reuses the hash values stored in the dictionary, saving unnecessary calls to hash(key). Second, using itervalues will access the values directly and avoid the many memory allocator calls using by items to form new many key/value tuples in memory every time you do a lookup.

Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
  • 1
    @RaymondHettinger Correct me if I'm wrong, but I thought `dict` itself doesn't cache its keys' hash values - although individual classes (like `str`) may and do choose to cache their hashes. At least when I created a `dict` with my custom class instances used as keys, their `__hash__` methods were called on every access operation (python 3.4). Whether or not I'm correct, I am not sure how `hash(frozenset(self))` can reuse the pre-computed hash values, unless they are cached inside the keys themselves (in which case, `hash(frozenset(self.items())` reuses them as well). – max Apr 13 '15 at 09:17
  • As to your second point about (key/value) tuple creation, I thought .items() methods returns a view rather than a list of tuples, and that the creation of that view does not involve copying the underlying keys and values. (Python 3.4 again.) That said, I do see the advantage of hashing just the keys if most inputs have different keys - because (1) it's quite expensive to hash values, and (2) it's quite restrictive to require that values are hashable – max Apr 13 '15 at 10:49
  • 13
    This also has the possibility of creating the same hash for two different dictionaries. Consider `{'one': 1, 'two': 2}` and `{'one': 2, 'two': 1}` – AgDude Sep 10 '15 at 13:24
  • Mike Graham in his [comment](http://stackoverflow.com/questions/1151658/python-hashable-dicts#comment44936774_2705638) states that *Deriving dict for any other reason but to define `__missing__` is a bad idea.* What do you think? – Piotr Dobrogost Mar 15 '16 at 12:14
  • 3
    Subclassing from dict has been well defined since Python 2.2. See collections.OrderedDict and collections.Counter for examples from the Python standard library. The other comment is based on the unfounded belief that only subclasses of MutableMapping are well defined. – Raymond Hettinger Mar 16 '16 at 16:04
  • 1
    @AgDude the possibility of two dictionaries to have the same hash is OK. It has been explained before the second solution – Nwawel A Iroume Aug 06 '20 at 17:00
26

The given answers are okay, but they could be improved by using frozenset(...) instead of tuple(sorted(...)) to generate the hash:

>>> import timeit
>>> timeit.timeit('hash(tuple(sorted(d.iteritems())))', "d = dict(a=3, b='4', c=2345, asdfsdkjfew=0.23424, x='sadfsadfadfsaf')")
4.7758948802947998
>>> timeit.timeit('hash(frozenset(d.iteritems()))', "d = dict(a=3, b='4', c=2345, asdfsdkjfew=0.23424, x='sadfsadfadfsaf')")
1.8153600692749023

The performance advantage depends on the content of the dictionary, but in most cases I've tested, hashing with frozenset is at least 2 times faster (mainly because it does not need to sort).

Oben Sonne
  • 9,893
  • 2
  • 40
  • 61
  • 2
    Note, there is no need to include both the keys and and the values. This solution would be *much* faster as: ``hash(frozenset(d))``. – Raymond Hettinger Apr 23 '13 at 06:09
  • 17
    @RaymondHettinger: `hash(frozenset(d))` results in identical hashes for 2 dicts with similar keys but different values! – Oben Sonne Apr 23 '13 at 08:45
  • 6
    That isn't a problem. It is the job of \_\_eq\_\_ to distinguish between dicts of differing values. The job of \_\_hash\_\_ is merely to reduce the search space. – Raymond Hettinger Apr 23 '13 at 21:10
  • 7
    That's true for the theoretical concept of hashes and mappings but not practical for caches with dictionaries as lookups -- it is not uncommon that dictionaries with similar keys but different values are passed to a mem-cached function. In that case the cache pratically turns into a list instead of a mapping if only the keys are used for building a hash. – Oben Sonne Apr 26 '13 at 13:05
  • 4
    In the special case of dicts with indentical keys and distinct values, you would be better-off just storing a hash based on ``frozenset(d.itervalues())``. In cases where dicts have distinct keys, ``frozenset(d)`` is *much* faster and imposes no restrictions on the hashability of keys. Lastly, remember that the dict.\_\_eq\_\_ method will check for equal key/value pairs much quicker that anything can compute the hash for all the key/value pair tuples. Using key/value tuples is also problematic because it throws away the stored hashes for all the keys (that is why ``frozenset(d)`` is so fast). – Raymond Hettinger Apr 29 '13 at 05:55
  • 1
    See my solution for a way to take values into account without calling *iteritems* which throws away the stored hash values for each key and which unnecessarily fills memory with new key/value tuples on *every* lookup. – Raymond Hettinger Apr 29 '13 at 06:13
15

A reasonably clean, straightforward implementation is

import collections

class FrozenDict(collections.Mapping):
    """Don't forget the docstrings!!"""

    def __init__(self, *args, **kwargs):
        self._d = dict(*args, **kwargs)

    def __iter__(self):
        return iter(self._d)

    def __len__(self):
        return len(self._d)

    def __getitem__(self, key):
        return self._d[key]

    def __hash__(self):
        return hash(tuple(sorted(self._d.iteritems())))
Mike Graham
  • 73,987
  • 14
  • 101
  • 130
  • Why is it so reasonable, clean and straightforward? I.e. please explain the differences to other answers, e.g. the necessity of `__iter__` and `__len__`. – Kalle Richter Dec 31 '14 at 01:18
  • 2
    @KarlRichter, I never said it was reasonable, just reasonably clean. ;) – Mike Graham Feb 03 '15 at 05:06
  • @KarlRichter, I define `__iter__` and `__len__` because I have to, since I'm deriving `collections.Mapping`; how to use `collections.Mapping` is covered pretty well in the collections module documentation. Other people don't feel the need to since they're deriving `dict`. Deriving `dict` for any other reason but to define `__missing__` is a bad idea. The dict spec doesn't say how dict works in such a case, and in reality this will end up having tons of non-virtual methods that are less useful in general and in this particular case will have vestigial methods with irrelevant behavior. – Mike Graham Feb 03 '15 at 05:10
7

I keep coming back to this topic... Here's another variation. I'm uneasy with subclassing dict to add a __hash__ method; There's virtually no escape from the problem that dict's are mutable, and trusting that they won't change seems like a weak idea. So I've instead looked at building a mapping based on a builtin type that is itself immutable. although tuple is an obvious choice, accessing values in it implies a sort and a bisect; not a problem, but it doesn't seem to be leveraging much of the power of the type it's built on.

What if you jam key, value pairs into a frozenset? What would that require, how would it work?

Part 1, you need a way of encoding the 'item's in such a way that a frozenset will treat them mainly by their keys; I'll make a little subclass for that.

import collections
class pair(collections.namedtuple('pair_base', 'key value')):
    def __hash__(self):
        return hash((self.key, None))
    def __eq__(self, other):
        if type(self) != type(other):
            return NotImplemented
        return self.key == other.key
    def __repr__(self):
        return repr((self.key, self.value))

That alone puts you in spitting distance of an immutable mapping:

>>> frozenset(pair(k, v) for k, v in enumerate('abcd'))
frozenset([(0, 'a'), (2, 'c'), (1, 'b'), (3, 'd')])
>>> pairs = frozenset(pair(k, v) for k, v in enumerate('abcd'))
>>> pair(2, None) in pairs
True
>>> pair(5, None) in pairs
False
>>> goal = frozenset((pair(2, None),))
>>> pairs & goal
frozenset([(2, None)])

D'oh! Unfortunately, when you use the set operators and the elements are equal but not the same object; which one ends up in the return value is undefined, we'll have to go through some more gyrations.

>>> pairs - (pairs - goal)
frozenset([(2, 'c')])
>>> iter(pairs - (pairs - goal)).next().value
'c'

However, looking values up in this way is cumbersome, and worse, creates lots of intermediate sets; that won't do! We'll create a 'fake' key-value pair to get around it:

class Thief(object):
    def __init__(self, key):
        self.key = key
    def __hash__(self):
        return hash(pair(self.key, None))
    def __eq__(self, other):
        self.value = other.value
        return pair(self.key, None) == other

Which results in the less problematic:

>>> thief = Thief(2)
>>> thief in pairs
True
>>> thief.value
'c'

That's all the deep magic; the rest is wrapping it all up into something that has an interface like a dict. Since we're subclassing from frozenset, which has a very different interface, there's quite a lot of methods; we get a little help from collections.Mapping, but most of the work is overriding the frozenset methods for versions that work like dicts, instead:

class FrozenDict(frozenset, collections.Mapping):
    def __new__(cls, seq=()):
        return frozenset.__new__(cls, (pair(k, v) for k, v in seq))
    def __getitem__(self, key):
        thief = Thief(key)
        if frozenset.__contains__(self, thief):
            return thief.value
        raise KeyError(key)
    def __eq__(self, other):
        if not isinstance(other, FrozenDict):
            return dict(self.iteritems()) == other
        if len(self) != len(other):
            return False
        for key, value in self.iteritems():
            try:
                if value != other[key]:
                    return False
            except KeyError:
                return False
        return True
    def __hash__(self):
        return hash(frozenset(self.iteritems()))
    def get(self, key, default=None):
        thief = Thief(key)
        if frozenset.__contains__(self, thief):
            return thief.value
        return default
    def __iter__(self):
        for item in frozenset.__iter__(self):
            yield item.key
    def iteritems(self):
        for item in frozenset.__iter__(self):
            yield (item.key, item.value)
    def iterkeys(self):
        for item in frozenset.__iter__(self):
            yield item.key
    def itervalues(self):
        for item in frozenset.__iter__(self):
            yield item.value
    def __contains__(self, key):
        return frozenset.__contains__(self, pair(key, None))
    has_key = __contains__
    def __repr__(self):
        return type(self).__name__ + (', '.join(repr(item) for item in self.iteritems())).join('()')
    @classmethod
    def fromkeys(cls, keys, value=None):
        return cls((key, value) for key in keys)

which, ultimately, does answer my own question:

>>> myDict = {}
>>> myDict[FrozenDict(enumerate('ab'))] = 5
>>> FrozenDict(enumerate('ab')) in myDict
True
>>> FrozenDict(enumerate('bc')) in myDict
False
>>> FrozenDict(enumerate('ab', 3)) in myDict
False
>>> myDict[FrozenDict(enumerate('ab'))]
5
SingleNegationElimination
  • 151,563
  • 33
  • 264
  • 304
6

The accepted answer by @Unknown, as well as the answer by @AlexMartelli work perfectly fine, but only under the following constraints:

  1. The dictionary's values must be hashable. For example, hash(hashabledict({'a':[1,2]})) will raise TypeError.
  2. Keys must support comparison operation. For example, hash(hashabledict({'a':'a', 1:1})) will raise TypeError.
  3. The comparison operator on keys imposes total ordering. For example, if the two keys in a dictionary are frozenset((1,2,3)) and frozenset((4,5,6)), they compare unequal in both directions. Therefore, sorting the items of a dictionary with such keys can result in an arbitrary order, and therefore will violate the rule that equal objects must have the same hash value.

The much faster answer by @ObenSonne lifts the constraints 2 and 3, but is still bound by constraint 1 (values must be hashable).

The faster yet answer by @RaymondHettinger lifts all 3 constraints because it does not include .values() in the hash calculation. However, its performance is good only if:

  1. Most of the (non-equal) dictionaries that need to be hashed have do not identical .keys().

If this condition isn't satisfied, the hash function will still be valid, but may cause too many collisions. For example, in the extreme case where all the dictionaries are generated from a website template (field names as keys, user input as values), the keys will always be the same, and the hash function will return the same value for all the inputs. As a result, a hashtable that relies on such a hash function will become as slow as a list when retrieving an item (O(N) instead of O(1)).

I think the following solution will work reasonably well even if all 4 constraints I listed above are violated. It has an additional advantage that it can hash not only dictionaries, but any containers, even if they have nested mutable containers.

I'd much appreciate any feedback on this, since I only tested this lightly so far.

# python 3.4
import collections
import operator
import sys
import itertools
import reprlib

# a wrapper to make an object hashable, while preserving equality
class AutoHash:
    # for each known container type, we can optionally provide a tuple
    # specifying: type, transform, aggregator
    # even immutable types need to be included, since their items
    # may make them unhashable

    # transformation may be used to enforce the desired iteration
    # the result of a transformation must be an iterable
    # default: no change; for dictionaries, we use .items() to see values

    # usually transformation choice only affects efficiency, not correctness

    # aggregator is the function that combines all items into one object
    # default: frozenset; for ordered containers, we can use tuple

    # aggregator choice affects both efficiency and correctness
    # e.g., using a tuple aggregator for a set is incorrect,
    # since identical sets may end up with different hash values
    # frozenset is safe since at worst it just causes more collisions
    # unfortunately, no collections.ABC class is available that helps
    # distinguish ordered from unordered containers
    # so we need to just list them out manually as needed

    type_info = collections.namedtuple(
        'type_info',
        'type transformation aggregator')

    ident = lambda x: x
    # order matters; first match is used to handle a datatype
    known_types = (
        # dict also handles defaultdict
        type_info(dict, lambda d: d.items(), frozenset), 
        # no need to include set and frozenset, since they are fine with defaults
        type_info(collections.OrderedDict, ident, tuple),
        type_info(list, ident, tuple),
        type_info(tuple, ident, tuple),
        type_info(collections.deque, ident, tuple),
        type_info(collections.Iterable, ident, frozenset) # other iterables
    )

    # hash_func can be set to replace the built-in hash function
    # cache can be turned on; if it is, cycles will be detected,
    # otherwise cycles in a data structure will cause failure
    def __init__(self, data, hash_func=hash, cache=False, verbose=False):
        self._data=data
        self.hash_func=hash_func
        self.verbose=verbose
        self.cache=cache
        # cache objects' hashes for performance and to deal with cycles
        if self.cache:
            self.seen={}

    def hash_ex(self, o):
        # note: isinstance(o, Hashable) won't check inner types
        try:
            if self.verbose:
                print(type(o),
                    reprlib.repr(o),
                    self.hash_func(o),
                    file=sys.stderr)
            return self.hash_func(o)
        except TypeError:
            pass

        # we let built-in hash decide if the hash value is worth caching
        # so we don't cache the built-in hash results
        if self.cache and id(o) in self.seen:
            return self.seen[id(o)][0] # found in cache

        # check if o can be handled by decomposing it into components
        for typ, transformation, aggregator in AutoHash.known_types:
            if isinstance(o, typ):
                # another option is:
                # result = reduce(operator.xor, map(_hash_ex, handler(o)))
                # but collisions are more likely with xor than with frozenset
                # e.g. hash_ex([1,2,3,4])==0 with xor

                try:
                    # try to frozenset the actual components, it's faster
                    h = self.hash_func(aggregator(transformation(o)))
                except TypeError:
                    # components not hashable with built-in;
                    # apply our extended hash function to them
                    h = self.hash_func(aggregator(map(self.hash_ex, transformation(o))))
                if self.cache:
                    # storing the object too, otherwise memory location will be reused
                    self.seen[id(o)] = (h, o)
                if self.verbose:
                    print(type(o), reprlib.repr(o), h, file=sys.stderr)
                return h

        raise TypeError('Object {} of type {} not hashable'.format(repr(o), type(o)))

    def __hash__(self):
        return self.hash_ex(self._data)

    def __eq__(self, other):
        # short circuit to save time
        if self is other:
            return True

        # 1) type(self) a proper subclass of type(other) => self.__eq__ will be called first
        # 2) any other situation => lhs.__eq__ will be called first

        # case 1. one side is a subclass of the other, and AutoHash.__eq__ is not overridden in either
        # => the subclass instance's __eq__ is called first, and we should compare self._data and other._data
        # case 2. neither side is a subclass of the other; self is lhs
        # => we can't compare to another type; we should let the other side decide what to do, return NotImplemented
        # case 3. neither side is a subclass of the other; self is rhs
        # => we can't compare to another type, and the other side already tried and failed;
        # we should return False, but NotImplemented will have the same effect
        # any other case: we won't reach the __eq__ code in this class, no need to worry about it

        if isinstance(self, type(other)): # identifies case 1
            return self._data == other._data
        else: # identifies cases 2 and 3
            return NotImplemented

d1 = {'a':[1,2], 2:{3:4}}
print(hash(AutoHash(d1, cache=True, verbose=True)))

d = AutoHash(dict(a=1, b=2, c=3, d=[4,5,6,7], e='a string of chars'),cache=True, verbose=True)
print(hash(d))
max
  • 49,282
  • 56
  • 208
  • 355
2

You might also want to add these two methods to get the v2 pickling protocol work with hashdict instances. Otherwise cPickle will try to use hashdict.____setitem____ resulting in a TypeError. Interestingly, with the other two versions of the protocol your code works just fine.

def __setstate__(self, objstate):
    for k,v in objstate.items():
        dict.__setitem__(self,k,v)
def __reduce__(self):
    return (hashdict, (), dict(self),)
Giovanni
  • 1
  • 1
0

This is the correct way with python3:

    class KeyDict(dict):
        def __hash__(self):
            return hash(frozenset(self.items()))

The accepted answer performs sorting, which is slower than using a set, and the other popular answer with frozenset(self) is incorrect (two different dicts can have the same input to the hash). In practice, this can result in excessive collisions, reducing your lookups to O(n) for certain cases.

The solution above is mentioned in a comment, but should be the main answer.

Erik Aronesty
  • 11,620
  • 5
  • 64
  • 44
-1

serialize the dict as string with json package:

d = {'a': 1, 'b': 2}
s = json.dumps(d)

restore the dict when you need:

d2 = json.loads(s)
-2

If you don't put numbers in the dictionary and you never lose the variables containing your dictionaries, you can do this:

cache[id(rule)] = "whatever"

since id() is unique for every dictionary

EDIT:

Oh sorry, yeah in that case what the other guys said would be better. I think you could also serialize your dictionaries as a string, like

cache[ 'foo:bar' ] = 'baz'

If you need to recover your dictionaries from the keys though, then you'd have to do something uglier like

cache[ 'foo:bar' ] = ( {'foo':'bar'}, 'baz' )

I guess the advantage of this is that you wouldn't have to write as much code.

  • Hmmm, no; this isn't what I'm Looking for: `cache[id({'foo':'bar'})] = 'baz'; id({'foo':'bar'}) not in cache` , Being able to dynamically create keys is of importance for when I want to use dicts as keys in the first place. – SingleNegationElimination Apr 13 '12 at 00:45
  • 1
    Serializing the dicts might be ok, do you have a reccomendation on a way to serialize them? that's what I'm looking for. – SingleNegationElimination May 21 '12 at 21:57