Find duplicates in python list of dictionaries

Question

I have kind of dictionary below:

a = [{'un': 'a', 'id': "cd"}, {'un': 'b', 'id': "cd"},{'un': 'b', 'id':    "cd"}, {'un': 'c', 'id': "vd"},
    {'un': 'c', 'id': "a"}, {'un': 'c', 'id': "vd"}, {'un': 'a', 'id': "cm"}]

I need to find the duplicates of dictionaries by 'un' key, for example this {'un': 'a', 'id': "cd"} and this {'un': 'a', 'id': "cm"} dicts are duplicates by value of key 'un' secondly when the duplicates are found I need to make decision what dict to keep concerning its second value of the key 'id', for example we keep dict with pattern value "cm".

I have already made the firs step see the code below:

from collections import defaultdict
temp_ids = []
dup_dict = defaultdict(list)
for number, row  in enumerate(a):
    id = row['un']
    if id not in temp_ids:
        temp_ids.append(id)
    else:
        tally[id].append(number)

Using this code I more or less able to find indexes of duplicate lists, maybe there is other method to do it. And also I need the next step code that makes decision what dict keep and what omit. Will be very grateful for help.

Do you need to use a list of dictionaries for this? A dataframe might be better suited to this kind of task — C_Z_, Jul 27 '16 at 19:48
are you asking to find duplicates of they key itself or duplicates of the value at `['un']`? — Aaron, Jul 27 '16 at 19:49
I am getting tis data in list of dicts maybe if future will try hanks for suggestion! — Yan, Jul 27 '16 at 19:53
`dict` is a built-in python command for creating a dictionary, so you probably want to avoid using it as a variable name. — Chris Mueller, Jul 27 '16 at 20:10

score 7 · Answer 1 · answered Dec 03 '21 at 21:26

Previous answers do not work well with a List where the Dictionaries have more than two items (i.e. they only retain up to two of the key-value pairs - what if one wants to keep all the key-value pairs, but remove the ones where a specific key is duplicated?)

To avoid adding a new item to a List of Dicts where one specific key is duplicated, you can do this:

import pandas as pd

all = [
    {"email":"art@art.com", "dn":"Art", "pid":11293849},
    {"email":"bob@bob.com", "dn":"Bob", "pid":12973129},
    {"email":"art@art.com", "dn":"Art", "pid":43975349},
    {"email":"sam@sam.com", "dn":"Sam", "pid":92379234},
]

df = pd.DataFrame(all)
df.drop_duplicates(subset=['email'], keep='last', inplace=True)
all = df.to_dict("records")
print(all)

This should be the answer as it is the most valuable for other readers with similar problems. Made with compatibility in mind. — harmonica141, Mar 25 '22 at 16:22

Mazdak · Answer 2 · 2018-04-24T12:22:25.453

In general if you want to find duplicates in a list of dictionaries you should categorize your dictionaries in a way that duplicate ones stay in same groups. For that purpose you need to categorize based on dict items. Now, since for dictionaries Order is not an important factor you need to use a container that is both hashable and doesn't keep the order of its container. A frozenset() is the best choice for this task.

Example:

In [87]: lst = [{2: 4, 6: 0},{20: 41, 60: 88},{5: 10, 2: 4, 6: 0},{20: 41, 60: 88},{2: 4, 6: 0}]

In [88]: result = defaultdict(list)

In [89]: for i, d in enumerate(lst):
    ...:     result[frozenset(d.items())].append(i)
    ...:     
In [91]: result
Out[91]: 
defaultdict(list,
            {frozenset({(2, 4), (6, 0)}): [0, 4],
             frozenset({(20, 41), (60, 88)}): [1, 3],
             frozenset({(2, 4), (5, 10), (6, 0)}): [2]})

And in this case, you can categorize your dictionaries based on 'un' key then choose the expected items based on id:

>>> from collections import defaultdict
>>> 
>>> d = defaultdict(list)
>>> 
>>> for i in a:
...     d[i['un']].append(i)
... 
>>> d
defaultdict(<type 'list'>, {'a': [{'un': 'a', 'id': 'cd'}, {'un': 'a', 'id': 'cm'}], 'c': [{'un': 'c', 'id': 'vd'}, {'un': 'c', 'id': 'a'}, {'un': 'c', 'id': 'vd'}], 'b': [{'un': 'b', 'id': 'cd'}, {'un': 'b', 'id': 'cd'}]})
>>> 
>>> keeps = {'a': 'cm', 'b':'cd', 'c':'vd'} # the key is 'un' and the value is 'id' should be keep for that 'un'
>>> 
>>> [i for key, val in d.items() for i in val if i['id']==keeps[key]]
[{'un': 'a', 'id': 'cm'}, {'un': 'c', 'id': 'vd'}, {'un': 'c', 'id': 'vd'}, {'un': 'b', 'id': 'cd'}, {'un': 'b', 'id': 'cd'}]
>>>

In the last line (the nested list comprehension) we loop over the aggregated dict's items then over the values and keep those items within the values that follows or condition which is i['id']==keeps[key] that means we will keep the items that has an id with specified values in keeps dictionary.

You can beak the list comprehension to something like this:

final_list = []
for key, val in d.items():
    for i in val:
        if i['id']==keeps[key]:
             final_list.append(i)

Note that since the iteration of list comprehensions has performed in C it's very faster than regular python loops and in the pythonic way to go. But if the performance is not important for you you can use the regular approach.

could you explain briefly your one liner in the end? – Yan Jul 27 '16 at 20:02 — Yan, Jul 27 '16 at 20:02

score 2 · Answer 3 · answered Jul 27 '16 at 20:01

you were pretty much on the right track with a defaultdict... here's roughly how I would write it.

from collections import defaultdict
a = [{'un': 'a', 'id': "cd"}, {'un': 'b', 'id': "cd"},{'un': 'b', 'id':    "cd"}, {'un': 'c', 'id': "vd"}, {'un': 'c', 'id': "a"}, {'un': 'c', 'id': "vd"}, {'un': 'a', 'id': "cm"}]

items = defaultdict(list)
for row in a:
    items[row['un']].append(row['id'])  #make a list of 'id' values for each 'un' key

for key in items.keys():
    if len(items[key]) > 1:  #if there is more than one 'id'
        newValue = somefunc(items[key])  #decided which of the list items to keep
        items[key] = newValue  #put that new value back into the dictionary

Find duplicates in python list of dictionaries

3 Answers3

Linked

Related