Unable to remove duplicate dicts in list using list comprehension or frozenset

Question

I would like to remove duplicate dicts in list.

Specifically, if two dict having the same content under the key paper_title, maintain one and remove the other duplicate.

For example, given the list below

test_list = [{"paper_title": 'This is duplicate', 'Paper_year': 2}, \
             {"paper_title": 'This is duplicate', 'Paper_year': 3}, \
             {"paper_title": 'Unique One', 'Paper_year': 3}, \
             {"paper_title": 'Unique two', 'Paper_year': 3}]

It should return

return_value = [{"paper_title": 'This is duplicate', 'Paper_year': 2}, \
             {"paper_title": 'Unique One', 'Paper_year': 3}, \
             {"paper_title": 'Unique two', 'Paper_year': 3}]

According to the tutorial, this can be achieved using list comprehension or frozenet. Such that

test_list = [{"paper_title": 'This is duplicate', 'Paper_year': 2}, \
             {"paper_title": 'This is duplicate', 'Paper_year': 3}, \
             {"paper_title": 'Unique One', 'Paper_year': 3}, \
             {"paper_title": 'Unique two', 'Paper_year': 3}]


return_value= [i for n, i in enumerate(test_list) if i not in test_list[n + 1:]]

However,it return no duplicates

return_value = [{"paper_title": 'This is duplicate', 'Paper_year': 2}, \
                 {"paper_title": 'This is duplicate', 'Paper_year': 3}, \
                 {"paper_title": 'Unique One', 'Paper_year': 3}, \
                 {"paper_title": 'Unique two', 'Paper_year': 3}]

May I know, which part of the code, I should change?

Also, is there any more faster way to achieve similar result?

Your second `dict` isn't a duplicate since the `'Paper_year'` value differs (if it was the same, your code from the tutorial would work). Do you want the concept of duplicate to be based solely on `"paper_title"`, keeping the first unique value each time? — ShadowRanger, Jul 08 '20 at 03:49
Thanks for the prompt reply. Yes, I want to find the duplicate based on the key "paper_title" — mpx, Jul 08 '20 at 03:53

score 2 · Accepted Answer · answered Jul 08 '20 at 03:53

2

It is because your sample dicts are strictly all different. If you change Paper_year to same, it works as expected:

test_list = [{"paper_title": 'This is duplicate', 'Paper_year': 3}, \ # Change 2 to 3
             {"paper_title": 'This is duplicate', 'Paper_year': 3}, \
             {"paper_title": 'Unique One', 'Paper_year': 3}, \
             {"paper_title": 'Unique two', 'Paper_year': 3}]

[i for n, i in enumerate(test_list) if i not in test_list[n + 1:]]
#[{'Paper_year': 3, 'paper_title': 'This is duplicate'},
# {'Paper_year': 3, 'paper_title': 'Unique One'},
# {'Paper_year': 3, 'paper_title': 'Unique two'}]

One way to achieve the expected output using itertools.groupby:

from itertools import groupby

f = lambda x: x["paper_title"]
[next(g) for k, g in groupby(sorted(test_list, key=f),key=f)]

Output:

[{'Paper_year': 2, 'paper_title': 'This is duplicate'},
 {'Paper_year': 3, 'paper_title': 'Unique One'},
 {'Paper_year': 3, 'paper_title': 'Unique two'}]

answered Jul 08 '20 at 03:53

Chris

29,127
3
28
51

Thanks for the suggestion. Out of curiosity, what advantage is this approach compare the suggestion made by @Kuldeep which still implement the list comprehension approach? – mpx Jul 08 '20 at 03:58
1

@balandongiv Its lot faster. Try with `test_list2 = test_list*10000`. In my environment, `groupby` is about 100x faster with large list of dicts – Chris Jul 08 '20 at 04:03
1

Thanks for the explanation @Chrise, appreciate it. For future reader, this should explain why grouby is faster: https://docs.python.org/3/library/itertools.html – mpx Jul 08 '20 at 04:06
Is `sorted(...)` necessary ? – Philippe Jul 08 '20 at 05:53
@Philippe `itertools.groupby` cannot automatically sort. For example, `groupby(["a", "b", "a"])` will not yield `[["a", "a"], ["b"]`, but rather `[["a"], ["b"], ["a"]]` – Chris Jul 08 '20 at 05:56
Thanks for the explanation ! That makes sense. Then why `itertools.groupby` does not provide a second parameter `sort`, when it's `true`, it sorts the input ? – Philippe Jul 08 '20 at 06:08
@Philippe That would not be ideal if `groupby` and `sorted` may use different `key` function to do the sorting ;) – Chris Jul 08 '20 at 06:10
Hi @Chris, while the code work with charm on this mock dict, but I am having difficulty make it work on my actual setup. Appreciate if you can drop by and give your valuable insight about this issue which is accessible via the link : https://stackoverflow.com/q/62793097/6446053 – mpx Jul 08 '20 at 10:50

score 1 · Answer 2 · 2020-07-08T03:53:12.240

1

j = []
z = []
for i in test_list:
    for key,value in i.items():
       if key == "paper_title":
           if value not in z:
               j.append(i)          
               z.append(value)   
       else:
          j.append(i)

This simple code can be used

edited Jul 08 '20 at 03:53

answered Jul 08 '20 at 03:46

score 1 · Answer 3 · answered Jul 08 '20 at 03:52

1

In your answer you are comparing dicts duplicate, what you want to do is compare value of a key duplicate comparison

test_list = [{"paper_title": 'This is duplicate', 'Paper_year': 2}, \
             {"paper_title": 'This is duplicate', 'Paper_year': 3}, \
             {"paper_title": 'Unique One', 'Paper_year': 3}, \
             {"paper_title": 'Unique two', 'Paper_year': 3}]
def check_presence(l,v): #list,value
    for i in l: 
        if i['paper_title']==v :return True 
    return False
return_value= [i for n, i in enumerate(test_list) if not check_presence(test_list[:n],test_list[n]['paper_title'])]
print(return_value)

answered Jul 08 '20 at 03:52

Kuldeep Singh Sidhu

3,748
2
12
22

Thanks for the suggestion. But, I need to consider yours to @Chris suggestion in term of speed or compactness advantages before I could accept as an answer. – mpx Jul 08 '20 at 04:00
1

Ok! Let me think, how this can be improved! – Kuldeep Singh Sidhu Jul 08 '20 at 04:03

score 1 · Answer 4 · answered Jul 08 '20 at 03:56

So unlike the tutorial you are following, you are trying to find unique entries based upon a single key in a dictionary rather than unique entries across all the key values.

The condition you've added for constructing the list in the comprehension is: i not in test_list[n+1:]

Which basically is the same as checking to see if i is equal to any of the entries in the list from position n+1 to the end of the list.

Since {"paper_title": 'This is duplicate', 'Paper_year': 2} != {"paper_title": 'This is duplicate', 'Paper_year': 3}` you end up with both results in the list that you construct.

This is unlike the tutorial in which {'Akshat': 3} == {'Akshat': 3} so the second result is excluded.

Others have already responded with solutions that utilize the key, but I already typed this far so I hope this explanation adds a little more context to why it wasn't working.

score 0 · Answer 5 · answered Jul 08 '20 at 04:23

As per other answers - there are no pure duplicates. Simplest way to implement your requirement is use pandas IMHO

import pandas as pd
test_list = [{"paper_title": 'This is duplicate', 'Paper_year': 2}, \
             {"paper_title": 'This is duplicate', 'Paper_year': 3}, \
             {"paper_title": 'Unique One', 'Paper_year': 3}, \
             {"paper_title": 'Unique two', 'Paper_year': 3}]
test_list = pd.DataFrame(test_list).groupby("paper_title").first().reset_index().to_dict(orient="records")
test_list

output

[{'paper_title': 'This is duplicate', 'Paper_year': 2},
 {'paper_title': 'Unique One', 'Paper_year': 3},
 {'paper_title': 'Unique two', 'Paper_year': 3}]

Unable to remove duplicate dicts in list using list comprehension or frozenset

5 Answers5

Linked