3

EDIT: as @BrenBarn pointed out, the original didn't make sense.

Given a list of dicts (courtesy of csv.DictReader--they all have str keys and values) it'd be nice to remove duplicates by stuffing them all in a set, but this can't be done directly since dict isn't hashable. Some existing questions touch on how to fake __hash__() for sets/dicts but don't address which way should be preferred.

# i. concise but ugly round trip
filtered = [eval(x) for x in {repr(d) for d in pile_o_dicts}]

# ii. wordy but avoids round trip
filtered = []
keys = set()
for d in pile_o_dicts:
    key = str(d)
    if key not in keys:
        keys.add(key)
        filtered.append(d)

# iii. introducing another class for this seems Java-like?
filtered = {hashable_dict(x) for x in pile_o_dicts}

# iv. something else entirely

In the spirit of the Zen of Python what's the "obvious way to do it"?

Community
  • 1
  • 1
everial
  • 302
  • 1
  • 10
  • 3
    I'm not sure what you're asking. Are you looking for [a frozen dictionary](http://stackoverflow.com/questions/2703599/what-would-be-a-frozen-dict) (i.e., one that could be used as a key in another dict)? Your code examples don't make much sense (e.g., your first one just recreates the original `pile_o_dicts`). – BrenBarn Sep 05 '12 at 22:38
  • 2
    Not that this is the preferred solution or anything, but in your first example you can avoid the "round-trip" of eval'ing back the dict by just use a dict comprehension: `{repr(d):d for d in pile_o_dicts}.values()` instead of a set + a list comp with evals – jdi Sep 05 '12 at 22:51
  • @BrenBarn you're right, edited into something (hopefully!) sensible. – everial Sep 06 '12 at 00:05

3 Answers3

4

Based on your example code, I take your question to be something slightly different from what you literally say. You don't actually want to override __hash__() -- you just want to filter out duplicates in linear time, right? So you need to ensure the following for each dictionary: 1) every key-value pair is represented, and 2) they are represented in a stable order. You could use a sorted tuple of key-value pairs, but instead, I would suggest using frozenset. frozensets are hashable, and they avoid the overhead of sorting, which should improve performance (as this answer seems to confirm). The downside is that they take up more memory than tuples, so there is a space/time tradeoff here.

Also, your code uses sets to do the filtering, but that doesn't make a lot of sense. There's no need for that ugly eval step if you use a dictionary:

filtered = {frozenset(d.iteritems()):d for d in pile_o_dicts}.values()

Or in Python 3, assuming you want a list rather than a dictionary view:

filtered = list({frozenset(d.items()):d for d in pile_o_dicts}.values())

These are both bit clunky. For readability, consider breaking it into two lines:

dict_o_dicts = {frozenset(d.iteritems()):d for d in pile_o_dicts}
filtered = dict_o_dicts.values()

The alternative is an ordered tuple of tuples:

filtered = {tuple(sorted(d.iteritems())):d for d in pile_o_dicts}.values()

And a final note: don't use repr for this. Dictionaries that evaluate as equal can have different representations:

>>> d1 = {str(i):str(i) for i in range(300)}
>>> d2 = {str(i):str(i) for i in range(299, -1, -1)}
>>> d1 == d2
True
>>> repr(d1) == repr(d2)
False
Community
  • 1
  • 1
senderle
  • 145,869
  • 36
  • 209
  • 233
3

The artfully named pile_o_dicts can be converted to a canonical form by sorting their items lists:

 groups = {}
 for d in pile_o_dicts:
     k = tuple(sorted(d.items()))
     groups.setdefault(k, []).append(d)

This will group identical dictionaries together.

FWIW, the technique of using sorted(d.items()) is currently used in the standard library for functools.lru_cache() in order to recognize function calls that have the same keyword arguments. IOW, this technique is tried and true :-)

Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
2

If the dicts all have the same keys, you can use a namedtuple

>>> from collections import namedtuple
>>> nt = namedtuple('nt', pile_o_dicts[0])
>>> set(nt(**d) for d in pile_o_dicts)
John La Rooy
  • 295,403
  • 53
  • 369
  • 502