python 2.6-removing and counting duplicates in a list of dictionaries effeciently

Question

I'm trying to efficiently change:

[{'text': 'hallo world', 'num': 1}, 
 {'text': 'hallo world', 'num': 2}, 
 {'text': 'hallo world', 'num': 1}, 
 {'text': 'haltlo world', 'num': 1}, 
 {'text': 'hallo world', 'num': 1}, 
 {'text': 'hallo world', 'num': 1}, 
 {'text': 'hallo world', 'num': 1}]

into a list of dictionaries without duplicates and a count of duplicates:

[{'text': 'hallo world', 'num': 2, 'count':1}, 
 {'text': 'hallo world', 'num': 1, 'count':5}, 
 {'text': 'haltlo world', 'num': 1, 'count':1}]

So far, I have the following to find duplicates:

result = [dict(tupleized) for tupleized in set(tuple(item.items()) for item in li)]

and it returns:

[{'text': 'hallo world', 'num': 2}, 
 {'text': 'hallo world', 'num': 1}, 
 {'text': 'haltlo world', 'num': 1}]

THANKS!

I'd suggest you to use `collections.Counter` but `dict` type is not hashable :(. If you could turn those dicts into dict-like objects with a hash function `Counter` would work nice here. — Diego Navarro, Aug 09 '12 at 06:23
you can write your own algorithm based on `set`s. set('ABC')-set(ABC) = set([]) — Dmitry Zagorulkin, Aug 09 '12 at 06:28
Thanks. I'm using python 2.6 too. Counter is available for v2.7+ — tr33hous, Aug 09 '12 at 06:28
`tuple(items.items())` won't work properly as the even it the dicts are equal, the `items()` are not always in the same order. — John La Rooy, Aug 09 '12 at 06:36
@gnibbler If each dict has the same keys won't it always be in the same order? — jamylak, Aug 09 '12 at 06:49
@jamylak, no it depends on the order the keys are added. I posted an example on SO somewhere - IIRC just 5 keys added in a different order was enough — John La Rooy, Aug 09 '12 at 06:51
See [this](http://stackoverflow.com/a/9793956/566644) answer for an example. — Lauritz V. Thaulow, Aug 09 '12 at 06:56
@tr33hous What type of data are you storing? Is it just strings and counts (ints) and other immutable types? — jamylak, Aug 09 '12 at 07:17

score 6 · Accepted Answer · edited May 23 '17 at 10:24

6

I'll use one of my favourites from itertools:

from itertools import groupby

def canonicalize_dict(x):
    "Return a (key, value) list sorted by the hash of the key"
    return sorted(x.items(), key=lambda x: hash(x[0]))

def unique_and_count(lst):
    "Return a list of unique dicts with a 'count' key added"
    grouper = groupby(sorted(map(canonicalize_dict, lst)))
    return [dict(k + [("count", len(list(g)))]) for k, g in grouper]

a = [{'text': 'hallo world', 'num': 1},  
     #....
     {'text': 'hallo world', 'num': 1}]

print unique_and_count(a)

Output

[{'count': 5, 'text': 'hallo world', 'num': 1}, 
{'count': 1, 'text': 'hallo world', 'num': 2}, 
{'count': 1, 'text': 'haltlo world', 'num': 1}]

As gnibbler points out, d1.items() and d2.items() may have different key-ordering, even if the keys are identical, so I've introduced the canonical_dict function to address this concern.

edited May 23 '17 at 10:24

Community

1
1

answered Aug 09 '12 at 06:26

Lauritz V. Thaulow

49,139
12
73
92

sorting works if all the keys are sortable. It's possible to be hashable but not sortable - eg complex numbers. – John La Rooy Aug 09 '12 at 07:02
You have a tiny mistake in `unique_and_count` - it should be `for x in lst` and not `for x in a`. – Zaar Hai Aug 09 '12 at 07:05
@Zaar Thanks, missed that when I refactored the code earlier. – Lauritz V. Thaulow Aug 09 '12 at 07:06

jamylak · Answer 2 · 2012-08-09T07:14:36.317

6

Note: This now uses frozenset which means that the items in the dictionary must be hashable.

>>> from collections import defaultdict
>>> from itertools import chain
>>> data = [{'text': 'hallo world', 'num': 1}, {'text': 'hallo world', 'num': 2},  {'text': 'hallo world', 'num': 1}, {'text': 'haltlo world', 'num': 1}, {'text': 'hallo world', 'num': 1}, {'text': 'hallo world', 'num': 1}, {'text': 'hallo world', 'num': 1}]
>>> c = defaultdict(int)
>>> for d in data:
        c[frozenset(d.iteritems())] += 1


>>> [dict(chain(k, (('count', count),))) for k, count in c.iteritems()]
[{'count': 1, 'text': 'haltlo world', 'num': 1}, {'count': 1, 'text': 'hallo world', 'num': 2}, {'count': 5, 'text': 'hallo world', 'num': 1}]

edited Aug 09 '12 at 07:14

answered Aug 09 '12 at 06:28

jamylak

128,818
30
231
230

Awesome answer. The only caveat is that you need to know the names of the fields before hand. Thanks a lot anyways. Will use solution by @lazyr above. – tr33hous Aug 09 '12 at 06:32
@tr33hous You don't need to know them, I was just being explicit I'll change it now – jamylak Aug 09 '12 at 06:34
@tr33hous It doesn't need to know the fields now. Also note that this solution runs in O(N) while lazyr 's solution uses a sort which makes it O(N log N). If you are dealing with huge lists you will need to consider this. – jamylak Aug 09 '12 at 06:42
As gnibbler points out in question comments, `d.iteritems()` isn't guaranteed to return the keys in the same order for all the dictionaries. – Lauritz V. Thaulow Aug 09 '12 at 06:53
Yeah, just seen the edits. Sorry didn't take more time to look at solution. +1 – tr33hous Aug 09 '12 at 06:54
@tr33hous I will delete this answer since it is innacurate, unaccept plz or I can't delete it :D – jamylak Aug 09 '12 at 06:58
@jamylak Just borrow my `canonical_dict` function, and it'll work again. No need to delete. – Lauritz V. Thaulow Aug 09 '12 at 07:00
@jamylak I've made a small change to the function so that it handles unsortable keys (like complex numbers). – Lauritz V. Thaulow Aug 09 '12 at 07:08
@lazyr Changed to use `frozenset` instead, seems like it's fine since OP is just storint strings in this example. – jamylak Aug 09 '12 at 07:15
@jamylak Yes. But if the values are unhashable, it'll break. – Lauritz V. Thaulow Aug 09 '12 at 07:17
@lazyr Well he didn't mention any mutables in the example, if so OP can unaccept this anyway. – jamylak Aug 09 '12 at 07:19
@jamylak just removed it as the accepted answer so you can go ahed. Thanks!! – tr33hous Aug 09 '12 at 07:22
@tr33house I changed it so it works fine as long as there are no hashables. I'll just leave it now since it works – jamylak Aug 09 '12 at 07:55

score 1 · Answer 3 · answered Nov 22 '19 at 12:51

wanna simple solution without using any builtins,

>>> d = [{'text': 'hallo world', 'num': 1}, 
...  {'text': 'hallo world', 'num': 2}, 
...  {'text': 'hallo world', 'num': 1}, 
...  {'text': 'haltlo world', 'num': 1}, 
...  {'text': 'hallo world', 'num': 1}, 
...  {'text': 'hallo world', 'num': 1}, 
...  {'text': 'hallo world', 'num': 1}]
>>> 
>>> def unique_counter(filesets):
...      for i in filesets:
...          i['count'] = sum([1 for j in filesets if j['num'] == i['num']])
...      return {k['num']:k for k in filesets}.values()
... 
>>> unique_counter(d)
[{'count': 6, 'text': 'hallo world', 'num': 1}, {'count': 1, 'text': 'hallo world', 'num': 2}]

python 2.6-removing and counting duplicates in a list of dictionaries effeciently

3 Answers3

Linked