0

Summary: I have a list of dictionaries, some of the elements being "duplicated". How can I define "duplicate" and how can I influence which element stays?

Detailed question:

There is a great answer to a stack overflow question on how to remove duplicates from a list of dictionaries:

l = [
    {'name': 'john', 'age': 20},
    {'name': 'john', 'age': 20},
    {'name': 'mark', 'age': 21},
    {'name': 'john', 'age': 99},
    {'name': 'john'}
]
print [dict(tupleized) for tupleized in set(tuple(item.items()) for item in l)]

I would like, however, to control the definition of "duplicate". For instance let's say that any of the dictionaries which has the same value for 'name' are "duplicates" to me. The expected output would be two elements only (one entry for 'john' and one for 'mark').

I would also like to control which of the "duplicates" is retained. For instance only the one with the highest 'age' for a given 'name'. The pruned down list would therefore be

[{'age': 99, 'name': 'john'}, {'age': 21, 'name': 'mark'}]

How can I do this in a clever and pythonic way? (my current idea is to go for a loop over the key 'name' and set a flag (on the age in the case above) to copy the most relevant entry into a new list -- but I was hoping for something more elegant)

Community
  • 1
  • 1
WoJ
  • 27,165
  • 48
  • 180
  • 345
  • 2
    `set`s and `dict`s mark duplicates by objects that have the same `hash` value. If you really want control over duplication, then you should wrap these `dict`s in your own class and define a `__hash__` for that class. – inspectorG4dget Oct 24 '13 at 08:55
  • when in doubt, iterate over the collection yourself and accumulate things as you need, using the tools that python gives you to make life easier. I recommend `defaultdict` and/or `groupby`. – roippi Oct 24 '13 at 09:17
  • @inspectorG4dget: would you have a pointer to some examples on how to do that? There are already very good answers below but the idea to build my own comparison is very appealing – WoJ Oct 24 '13 at 11:55
  • 1
    Have a look at [this](http://stackoverflow.com/q/14493204/198633) and [this](http://stackoverflow.com/q/13001913/198633) – inspectorG4dget Oct 24 '13 at 16:24

1 Answers1

1
  1. Sort the list by age.
  2. Construct a dictionary from it using name as key. (The choice of key "defines the duplicate".) In case of duplicate keys, the previous value is replaced. Because the list is sorted, the highest age prevails.
  3. Construct again a list of dictionaries.

.

>>> uniq = {d['name']: d.get('age') for d in sorted(l, key=lambda d : d.get('age'))}
>>> uniq
{'john': 99, 'mark': 21}
>>> [{'name': k, 'age': v} for k, v in uniq.items()]
[{'age': 99, 'name': 'john'}, {'age': 21, 'name': 'mark'}]
Janne Karila
  • 24,266
  • 6
  • 53
  • 94