How to delete duplicate entries in a nested container

Question

I've got a data structure like these:

[{'remote': '1', 'quantity': 1.0, 'timestamp': 1}, 
{'remote': '2', 'quantity': 1.0, 'timestamp': 2},
{'remote': '2', 'quantity': 1.0, 'timestamp': 3}, ...]

a list of dictionaries. My task is to find duplicate entries regarding the remote value. If i found entries with the same remote value than i want to delete all except the one with the newest timestamp value.

In this example i had to find and delete the secound dictionary because the third one has the same remote, but a newer timestamp value.

Iam not that familiar with python. I've googled alot and found just solutions for lists like this:

How can I count the occurrences of a list item in Python?

My problem is, that iam not smart enough to apply this on my problem. Furthermore the solution should be somewhat efficient, because it has to run permanently in a backround job with rather low computing power.

Thank you for help!

Can you add your expected out put and your code that you have tried so far? — Mazdak, May 14 '15 at 10:22
are the keys in every dict and are the dicts in sorted order? — Padraic Cunningham, May 14 '15 at 11:12

BurningKarl · Answer 1 · 2015-05-14T10:48:32.730

1

If you have this:

data = [{"remote":1, "quantity":1.0, "timestamp":1},
        {"remote":2, "quantity":1.0, "timestamp":2},
        {"remote":2, "quantity":1.0, "timestamp":3}]

You can filter the entries like that:

filtered_data = []
for d1 in sorted(data, key=lambda e: e["timestamp"], reverse=True):
    for d2 in filtered_data:
        if d1["remote"] == d2["remote"]:
            break
    else:
        filtered_data.append(d1)

edited May 14 '15 at 10:48

answered May 14 '15 at 10:29

BurningKarl

1,176
9
12

It works now, but I think the solution of Stefan Pochmann is much more efficient than mine – BurningKarl May 14 '15 at 10:49

score 1 · Accepted Answer · answered May 14 '15 at 10:44

Input:

entries = [{'remote': '1', 'quantity': 1.0, 'timestamp': 1},
           {'remote': '2', 'quantity': 1.0, 'timestamp': 2},
           {'remote': '2', 'quantity': 1.0, 'timestamp': 3}]

Removal:

newest = {}
for entry in entries:
    current = newest.get(entry['remote'])
    if current is None or entry['timestamp'] > current['timestamp']:
        newest[entry['remote']] = entry
entries[:] = newest.values()

Output:

from pprint import pprint
pprint(entries)

Prints:
[{'quantity': 1.0, 'remote': '2', 'timestamp': 3},
 {'quantity': 1.0, 'remote': '1', 'timestamp': 1}]

Padraic Cunningham · Answer 3 · 2015-05-14T11:52:40.513

If your dicts are in sorted order based on the 'remote' key , you can group them by the 'remote' key and get the last entry which will be the latest timestamp.

l = [{'remote': '1', 'quantity': 1.0, 'timestamp': 1},
{'remote': '2', 'quantity': 1.0, 'timestamp': 2},
{'remote': '2', 'quantity': 1.0, 'timestamp': 3}]


from itertools import groupby
from operator import itemgetter

l[:] = (list(v)[-1] for _, v in groupby(l,key=(itemgetter("remote"))))

print(l)
[{'timestamp': 1, 'remote': '1', 'quantity': 1.0},
 {'timestamp': 3, 'remote': '2', 'quantity': 1.0}]

l[:] changes the original list, (list(v)[-1] for k,v in groupby(l,key=(itemgetter("remote")))) is a generator expression which means we don't need to store all the content in memory at once which if memory is also an issue will help.

This will also work for unsorted data once the dupes are always together and the latest dupe comes last:

l = [{'remote': '1', 'quantity': 1.0, 'timestamp': 1},
           {'remote': '4', 'quantity': 1.0, 'timestamp': 1},
           {'remote': '2', 'quantity': 1.0, 'timestamp': 2},
           {'remote': '2', 'quantity': 1.0, 'timestamp': 3}]

l[:] = (list(v)[-1] for k,v in groupby(l, key=(itemgetter("remote"))))

print(l)
[{'timestamp': 1, 'remote': '1', 'quantity': 1.0}, {'timestamp': 1, 'remote': '4', 'quantity': 1.0}, {'timestamp': 3, 'remote': '2', 'quantity': 1.0}]

Or if the dupes are not sorted get the max based on timestamp:

l = [{'remote': '1', 'quantity': 1.0, 'timestamp': 1},
           {'remote': '4', 'quantity': 1.0, 'timestamp': 1},
           {'remote': '2', 'quantity': 1.0, 'timestamp': 3},
           {'remote': '2', 'quantity': 1.0, 'timestamp': 2}]

l[:] = (max(v,key=itemgetter("timestamp")) for _, v in groupby(l, key=(itemgetter("remote")))


[{'timestamp': 1, 'remote': '1', 'quantity': 1.0}, {'timestamp': 1, 'remote': '4', 'quantity': 1.0}, {'timestamp': 3, 'remote': '2', 'quantity': 1.0}]

If you were going to sort you should do an inplace reverse sort by the remote key, them call next on the grouping v to get the latest:

l = [{'remote': '1', 'quantity': 1.0, 'timestamp': 1},
           {'remote': '4', 'quantity': 1.0, 'timestamp': 1},
           {'remote': '2', 'quantity': 1.0, 'timestamp': 3},
           {'remote': '2', 'quantity': 1.0, 'timestamp': 2}]

l.sort(key=itemgetter("remote"),reverse=True)
l[:] = (next(v) for _, v in groupby(l, key=(itemgetter("remote"))))

print(l)

Sorting will change the order of the dicts though so that may not be suitable for your problem, if your dicts are in order like your input then you don't need to worry about sorting anyway.

@JoseRicardoBustosM. I though both keys had to match to be considered a dupe but it is just the remote key. — Padraic Cunningham, May 14 '15 at 11:32
i appreciate your help. stefans solution is exactly what i was looking for. — BananaJoe, May 14 '15 at 14:18

score 0 · Answer 4 · answered May 14 '15 at 11:34

In [55]: from itertools import groupby

In [56]: from operator import itemgetter


In [58]: a
Out[58]: 
[{'quantity': 1.0, 'remote': '1', 'timestamp': 1},
 {'quantity': 1.0, 'remote': '2', 'timestamp': 2},
 {'quantity': 1.0, 'remote': '2', 'timestamp': 3}]

Sorted a based on timestamp and since you need the lastest(maximum),reversed is true

In [58]: s_a=sorted(a,key=lambda x: x['timestamp'],reverse = True)
In [59]: groups=[]
In [60]:for k,g in groupby(s_a,key=lambda x:x['remote']):
    groups.append(list(g))
In [69]: [elem[0] for elem in groups]
Out[69]: 
[{'quantity': 1.0, 'remote': '2', 'timestamp': 3},
 {'quantity': 1.0, 'remote': '1', 'timestamp': 1}]

How to delete duplicate entries in a nested container

4 Answers4