Remove duplicates from a list of dictionaries when only one of the key values is different

Question

I have seen some similar answers, but I can't find something specific for this case:

I have a list of dictionaries like this:

[
 {"element":Bla, "version":2, "date":"12/04/12"},
 {"element":Bla, "version":2, "date":"12/05/12"},
 {"element":Bla, "version":3, "date":"12/04/12"}
]

The actual dictionary has many other keys, but what I'm trying to do is to delete all the entries that have the exact same key pair values, except for the date. That is, to delete all duplicates (which are not really exact duplicates as only the date varies). In this case, what I would expect to get is:

[
 {"element":Bla, "version":2, "date":"12/04/12"},
 {"element":Bla, "version":3, "date":"12/04/12"}
]

Thanks for all the answers, I was a bit stuck. Another question though, how would you do it to only keep the latest date, given that the date was a date time object? — user443850, Mar 19 '13 at 23:12
If you don't need the dictionaries in order they are in the initial list, you can do it with nearly the same algorithm, but instead of `result=[]` you will have `result={}` containing pairs `h: d` and you will need to update `result` of `h` is missing or new `d` has newer date. You won't need `seen` here, as its role will be served by `result`. Though you will loose the order. If you need order, you will have to make 2 pass. The first one to gather the newest dates, and the second to build the ordered list. — ovgolovin, Mar 19 '13 at 23:26
I used the same solution but just sorted the list before running the duplicate finder snippet. — user443850, Mar 19 '13 at 23:44

score 6 · Accepted Answer · edited Nov 19 '15 at 11:21

6

You say you have a lot of other keys in the dictionary not mentioned in the question.

Here is O(n) algorithm to do what you need:

>>> seen = set()
>>> result = []
>>> for d in dicts:
...     h = d.copy()
...     h.pop('date')
...     h = tuple(h.items())
...     if h not in seen:
...         result.append(d)
...         seen.add(h)

>>> pprint(result)
[{'date': '12/04/12', 'element': 'Bla', 'version': 2},
 {'date': '12/04/12', 'element': 'Bla', 'version': 3}]

h is a copy of the dict. date key is removed from it with pop.

Then tuple is created as a hashable type which can be added to set.

If h has never been seen before, we append it to result and add to seen. Additions to seen is O(1) as well as lookups (h not in seen).

At the end, result contains only unique elements in terms of defined h values.

edited Nov 19 '15 at 11:21

Mojtaba Yousefi

626
1
9
27

answered Mar 19 '13 at 22:56

ovgolovin

13,063
6
47
78

I guess you should sort the `h.items()` – thkang Mar 19 '13 at 23:12
Sorry, in you example, seen is a list? Doesn't let me add a tuple to a list, but if append it works. – user443850 Mar 19 '13 at 23:13
@thkang Not necessarily. `h.items()` will return the same sequences for the same `h`s. – ovgolovin Mar 19 '13 at 23:14
@user443850 Sorry, I forgot to copy all the initialization from the IDLE. Now I edited the answer. – ovgolovin Mar 19 '13 at 23:16
I now get: TypeError, unhashable type: "list", on if h not in seen – user443850 Mar 19 '13 at 23:22
@user443850 Do you have any other data apart from strings in the initial dictionary? – ovgolovin Mar 19 '13 at 23:27
@user443850 Is `Bla` an object? – ovgolovin Mar 19 '13 at 23:32
Yes, integers and datetime objects. However, setting seen as a list works as well. There are also tuples and other types. – user443850 Mar 19 '13 at 23:32
@user443850 Something inside the dict hinders hashing. And `set` is worked with hashable data. – ovgolovin Mar 19 '13 at 23:34
@user443850 `int` are perfectly hashable. `datetime` are also hashable (I have just checked). You can check it yourself with `hash(x)` (use whatever you want to check in the place of `x`) – ovgolovin Mar 19 '13 at 23:36
@user443850 If you set `seen` as list, you will get `O(n^2)` algorithm, which is much slower, as list is searched through from beginning to the end each time `in seen` is used. If `seen` is `set`, `in seen` is constant time operation `O(1)`. – ovgolovin Mar 19 '13 at 23:38
I think I can live with the performance hit, as it is not a huge data set. Thanks for your help again.I do have tuples and lists in the dictionary. – user443850 Mar 19 '13 at 23:43
@user443850 Your are welcome! If you get sometimes concerned with the performance, just make an analysis what hinders you from hashing dicts (with `hash(d)`), and then just convert it to hashable structure, as `h` is just to remove repetitions, and it should correspond to `d` (just to determine `d`s equivalence) but not to be exactly the same. – ovgolovin Mar 19 '13 at 23:46
A tuple is only hashable if all the elements contained in it are also hashable. – dawg Mar 19 '13 at 23:46
@drewk Yes! From the comments I can't see which is unhashable (int, stirng, datetime are all hashable). – ovgolovin Mar 19 '13 at 23:48
I just found a problem with this method. h.items can returns different list orders each time, so the loop does not detect the tuple as a duplicate. Any ideas? – user443850 Apr 11 '13 at 01:58
@user443850 Please, give an example of this not-expected behaviour. – ovgolovin Apr 11 '13 at 07:04
@user443850 [This answer]( http://stackoverflow.com/questions/835092/python-dictionary-are-keys-and-values-always-the-same-order) claims that non-modified dict should iterate items in the same order. Still, you lists are produced from different dicts, and though the resultant dicts may be effectively the same, the corresponding lists may have different order. So, just sort the items before adding to `tuple`. Try this: `tuple(sorted(h.items()))` – ovgolovin Apr 11 '13 at 07:09
What if you want the latest date only for each duplicated item? – Lee Sai Mun Mar 18 '22 at 11:00

johnsyweb · Answer 2 · 2013-03-19T23:06:48.640

You could use the "unique_everseen" recipe from itertools to create a new list.

list(unique_everseen(original_list, key=lambda e: '{element}@{version}'.format(**e)))

If your "key" needs to be wider than the lambda I have written (to accomodate more values), then it's probably worth extracting to a function:

def key_without_date(element):
    return '@'.join(["{}".format(v) for k,v in element.iteritems() if k != 'date'])

list(unique_everseen(original_list, key=key_without_date))

score 1 · Answer 3 · answered Mar 19 '13 at 22:50

Apologies for terrible variable names. There is probably a cleaner way but this should work

seen = {(item["element"], item["version"]): False for item in mylist}

output = []
for item in mylist:
    item_key = (item["element"], item["version"])
    if not seen[item_key]:
        output.append(item)
        seen[item_key] = True

score 1 · Answer 4 · answered Mar 19 '13 at 22:56

Pandas can solve this quickly:

import pandas as pd
Bla = "Bla"
d = [
{"element":Bla, "version":2, "date":"12/04/12"},
{"element":Bla, "version":2, "date":"12/05/12"},
{"element":Bla, "version":3, "date":"12/04/12"}
]
df = pd.DataFrame(d)
df[~df.drop("date", axis=1).duplicated()]

output:

       date element  version
0  12/04/12     Bla        2
2  12/04/12     Bla        3

dawg · Answer 5 · 2013-03-19T23:44:11.653

This works:

LoD=[
{"element":'Bla', "version":2, 'list':[1,2,3], "date":"12/04/12"},
{"element":'Bla', "version":2, 'list':[1,2,3], "date":"12/05/12"},
{"element":'Bla', "version":3, 'list':[1,2,3], "date":"12/04/12"}
]

LoDcopy=[]
seen=set()


for d in LoD:
    dc=d.copy()
    del dc['date']
    s=dc.__str__()
    if s in seen: continue
    seen.add(s)
    LoDcopy.append(d)    

print LoDcopy

prints:

[{'date': '12/04/12', 'version': 2, 'list': [1, 2, 3], 'element': 'Bla'}, 
 {'date': '12/04/12', 'version': 3, 'list': [1, 2, 3], 'element': 'Bla'}]

Remove duplicates from a list of dictionaries when only one of the key values is different

5 Answers5

Linked

Related