1

I would like to remove duplicate dicts in list.

Specifically, if two dict having the same content under the key paper_title, maintain one and remove the other duplicate.

For example, given the list below

test_list = [{"paper_title": 'This is duplicate', 'Paper_year': 2}, \
             {"paper_title": 'This is duplicate', 'Paper_year': 3}, \
             {"paper_title": 'Unique One', 'Paper_year': 3}, \
             {"paper_title": 'Unique two', 'Paper_year': 3}]

It should return

return_value = [{"paper_title": 'This is duplicate', 'Paper_year': 2}, \
             {"paper_title": 'Unique One', 'Paper_year': 3}, \
             {"paper_title": 'Unique two', 'Paper_year': 3}]

According to the tutorial, this can be achieved using list comprehension or frozenet. Such that

test_list = [{"paper_title": 'This is duplicate', 'Paper_year': 2}, \
             {"paper_title": 'This is duplicate', 'Paper_year': 3}, \
             {"paper_title": 'Unique One', 'Paper_year': 3}, \
             {"paper_title": 'Unique two', 'Paper_year': 3}]


return_value= [i for n, i in enumerate(test_list) if i not in test_list[n + 1:]]

However,it return no duplicates

return_value = [{"paper_title": 'This is duplicate', 'Paper_year': 2}, \
                 {"paper_title": 'This is duplicate', 'Paper_year': 3}, \
                 {"paper_title": 'Unique One', 'Paper_year': 3}, \
                 {"paper_title": 'Unique two', 'Paper_year': 3}]

May I know, which part of the code, I should change?

Also, is there any more faster way to achieve similar result?

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
mpx
  • 3,081
  • 2
  • 26
  • 56
  • 2
    Your second `dict` isn't a duplicate since the `'Paper_year'` value differs (if it was the same, your code from the tutorial would work). Do you want the concept of duplicate to be based solely on `"paper_title"`, keeping the first unique value each time? – ShadowRanger Jul 08 '20 at 03:49
  • Thanks for the prompt reply. Yes, I want to find the duplicate based on the key "paper_title" – mpx Jul 08 '20 at 03:53

5 Answers5

2

It is because your sample dicts are strictly all different. If you change Paper_year to same, it works as expected:

test_list = [{"paper_title": 'This is duplicate', 'Paper_year': 3}, \ # Change 2 to 3
             {"paper_title": 'This is duplicate', 'Paper_year': 3}, \
             {"paper_title": 'Unique One', 'Paper_year': 3}, \
             {"paper_title": 'Unique two', 'Paper_year': 3}]

[i for n, i in enumerate(test_list) if i not in test_list[n + 1:]]
#[{'Paper_year': 3, 'paper_title': 'This is duplicate'},
# {'Paper_year': 3, 'paper_title': 'Unique One'},
# {'Paper_year': 3, 'paper_title': 'Unique two'}]

One way to achieve the expected output using itertools.groupby:

from itertools import groupby

f = lambda x: x["paper_title"]
[next(g) for k, g in groupby(sorted(test_list, key=f),key=f)]

Output:

[{'Paper_year': 2, 'paper_title': 'This is duplicate'},
 {'Paper_year': 3, 'paper_title': 'Unique One'},
 {'Paper_year': 3, 'paper_title': 'Unique two'}]
Chris
  • 29,127
  • 3
  • 28
  • 51
  • Thanks for the suggestion. Out of curiosity, what advantage is this approach compare the suggestion made by @Kuldeep which still implement the list comprehension approach? – mpx Jul 08 '20 at 03:58
  • 1
    @balandongiv Its lot faster. Try with `test_list2 = test_list*10000`. In my environment, `groupby` is about 100x faster with large list of dicts – Chris Jul 08 '20 at 04:03
  • 1
    Thanks for the explanation @Chrise, appreciate it. For future reader, this should explain why grouby is faster: https://docs.python.org/3/library/itertools.html – mpx Jul 08 '20 at 04:06
  • Is `sorted(...)` necessary ? – Philippe Jul 08 '20 at 05:53
  • @Philippe `itertools.groupby` cannot automatically sort. For example, `groupby(["a", "b", "a"])` will not yield `[["a", "a"], ["b"]`, but rather `[["a"], ["b"], ["a"]]` – Chris Jul 08 '20 at 05:56
  • Thanks for the explanation ! That makes sense. Then why `itertools.groupby` does not provide a second parameter `sort`, when it's `true`, it sorts the input ? – Philippe Jul 08 '20 at 06:08
  • @Philippe That would not be ideal if `groupby` and `sorted` may use different `key` function to do the sorting ;) – Chris Jul 08 '20 at 06:10
  • Hi @Chris, while the code work with charm on this mock dict, but I am having difficulty make it work on my actual setup. Appreciate if you can drop by and give your valuable insight about this issue which is accessible via the link : https://stackoverflow.com/q/62793097/6446053 – mpx Jul 08 '20 at 10:50
1
j = []
z = []
for i in test_list:
    for key,value in i.items():
       if key == "paper_title":
           if value not in z:
               j.append(i)          
               z.append(value)   
       else:
          j.append(i)                    
           

This simple code can be used

1

In your answer you are comparing dicts duplicate, what you want to do is compare value of a key duplicate comparison

test_list = [{"paper_title": 'This is duplicate', 'Paper_year': 2}, \
             {"paper_title": 'This is duplicate', 'Paper_year': 3}, \
             {"paper_title": 'Unique One', 'Paper_year': 3}, \
             {"paper_title": 'Unique two', 'Paper_year': 3}]
def check_presence(l,v): #list,value
    for i in l: 
        if i['paper_title']==v :return True 
    return False
return_value= [i for n, i in enumerate(test_list) if not check_presence(test_list[:n],test_list[n]['paper_title'])]
print(return_value)
Kuldeep Singh Sidhu
  • 3,748
  • 2
  • 12
  • 22
1

So unlike the tutorial you are following, you are trying to find unique entries based upon a single key in a dictionary rather than unique entries across all the key values.

The condition you've added for constructing the list in the comprehension is: i not in test_list[n+1:]

Which basically is the same as checking to see if i is equal to any of the entries in the list from position n+1 to the end of the list.

Since {"paper_title": 'This is duplicate', 'Paper_year': 2} != {"paper_title": 'This is duplicate', 'Paper_year': 3}` you end up with both results in the list that you construct.

This is unlike the tutorial in which {'Akshat': 3} == {'Akshat': 3} so the second result is excluded.

Others have already responded with solutions that utilize the key, but I already typed this far so I hope this explanation adds a little more context to why it wasn't working.

Cameron Cairns
  • 173
  • 1
  • 7
0

As per other answers - there are no pure duplicates. Simplest way to implement your requirement is use pandas IMHO

import pandas as pd
test_list = [{"paper_title": 'This is duplicate', 'Paper_year': 2}, \
             {"paper_title": 'This is duplicate', 'Paper_year': 3}, \
             {"paper_title": 'Unique One', 'Paper_year': 3}, \
             {"paper_title": 'Unique two', 'Paper_year': 3}]
test_list = pd.DataFrame(test_list).groupby("paper_title").first().reset_index().to_dict(orient="records")
test_list

output

[{'paper_title': 'This is duplicate', 'Paper_year': 2},
 {'paper_title': 'Unique One', 'Paper_year': 3},
 {'paper_title': 'Unique two', 'Paper_year': 3}]
Rob Raymond
  • 29,118
  • 3
  • 14
  • 30