1

I've got a data structure like these:

[{'remote': '1', 'quantity': 1.0, 'timestamp': 1}, 
{'remote': '2', 'quantity': 1.0, 'timestamp': 2},
{'remote': '2', 'quantity': 1.0, 'timestamp': 3}, ...]

a list of dictionaries. My task is to find duplicate entries regarding the remote value. If i found entries with the same remote value than i want to delete all except the one with the newest timestamp value.

In this example i had to find and delete the secound dictionary because the third one has the same remote, but a newer timestamp value.

Iam not that familiar with python. I've googled alot and found just solutions for lists like this:

How can I count the occurrences of a list item in Python?

My problem is, that iam not smart enough to apply this on my problem. Furthermore the solution should be somewhat efficient, because it has to run permanently in a backround job with rather low computing power.

Thank you for help!

Community
  • 1
  • 1
BananaJoe
  • 45
  • 7

4 Answers4

1

If you have this:

data = [{"remote":1, "quantity":1.0, "timestamp":1},
        {"remote":2, "quantity":1.0, "timestamp":2},
        {"remote":2, "quantity":1.0, "timestamp":3}]

You can filter the entries like that:

filtered_data = []
for d1 in sorted(data, key=lambda e: e["timestamp"], reverse=True):
    for d2 in filtered_data:
        if d1["remote"] == d2["remote"]:
            break
    else:
        filtered_data.append(d1)
BurningKarl
  • 1,176
  • 9
  • 12
1

Input:

entries = [{'remote': '1', 'quantity': 1.0, 'timestamp': 1},
           {'remote': '2', 'quantity': 1.0, 'timestamp': 2},
           {'remote': '2', 'quantity': 1.0, 'timestamp': 3}]

Removal:

newest = {}
for entry in entries:
    current = newest.get(entry['remote'])
    if current is None or entry['timestamp'] > current['timestamp']:
        newest[entry['remote']] = entry
entries[:] = newest.values()

Output:

from pprint import pprint
pprint(entries)

Prints:
[{'quantity': 1.0, 'remote': '2', 'timestamp': 3},
 {'quantity': 1.0, 'remote': '1', 'timestamp': 1}]
Stefan Pochmann
  • 27,593
  • 8
  • 44
  • 107
1

If your dicts are in sorted order based on the 'remote' key , you can group them by the 'remote' key and get the last entry which will be the latest timestamp.

l = [{'remote': '1', 'quantity': 1.0, 'timestamp': 1},
{'remote': '2', 'quantity': 1.0, 'timestamp': 2},
{'remote': '2', 'quantity': 1.0, 'timestamp': 3}]


from itertools import groupby
from operator import itemgetter

l[:] = (list(v)[-1] for _, v in groupby(l,key=(itemgetter("remote"))))

print(l)
[{'timestamp': 1, 'remote': '1', 'quantity': 1.0},
 {'timestamp': 3, 'remote': '2', 'quantity': 1.0}]

l[:] changes the original list, (list(v)[-1] for k,v in groupby(l,key=(itemgetter("remote")))) is a generator expression which means we don't need to store all the content in memory at once which if memory is also an issue will help.

This will also work for unsorted data once the dupes are always together and the latest dupe comes last:

l = [{'remote': '1', 'quantity': 1.0, 'timestamp': 1},
           {'remote': '4', 'quantity': 1.0, 'timestamp': 1},
           {'remote': '2', 'quantity': 1.0, 'timestamp': 2},
           {'remote': '2', 'quantity': 1.0, 'timestamp': 3}]

l[:] = (list(v)[-1] for k,v in groupby(l, key=(itemgetter("remote"))))

print(l)
[{'timestamp': 1, 'remote': '1', 'quantity': 1.0}, {'timestamp': 1, 'remote': '4', 'quantity': 1.0}, {'timestamp': 3, 'remote': '2', 'quantity': 1.0}]

Or if the dupes are not sorted get the max based on timestamp:

l = [{'remote': '1', 'quantity': 1.0, 'timestamp': 1},
           {'remote': '4', 'quantity': 1.0, 'timestamp': 1},
           {'remote': '2', 'quantity': 1.0, 'timestamp': 3},
           {'remote': '2', 'quantity': 1.0, 'timestamp': 2}]

l[:] = (max(v,key=itemgetter("timestamp")) for _, v in groupby(l, key=(itemgetter("remote")))


[{'timestamp': 1, 'remote': '1', 'quantity': 1.0}, {'timestamp': 1, 'remote': '4', 'quantity': 1.0}, {'timestamp': 3, 'remote': '2', 'quantity': 1.0}]

If you were going to sort you should do an inplace reverse sort by the remote key, them call next on the grouping v to get the latest:

l = [{'remote': '1', 'quantity': 1.0, 'timestamp': 1},
           {'remote': '4', 'quantity': 1.0, 'timestamp': 1},
           {'remote': '2', 'quantity': 1.0, 'timestamp': 3},
           {'remote': '2', 'quantity': 1.0, 'timestamp': 2}]

l.sort(key=itemgetter("remote"),reverse=True)
l[:] = (next(v) for _, v in groupby(l, key=(itemgetter("remote"))))

print(l)

Sorting will change the order of the dicts though so that may not be suitable for your problem, if your dicts are in order like your input then you don't need to worry about sorting anyway.

Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
0
In [55]: from itertools import groupby

In [56]: from operator import itemgetter


In [58]: a
Out[58]: 
[{'quantity': 1.0, 'remote': '1', 'timestamp': 1},
 {'quantity': 1.0, 'remote': '2', 'timestamp': 2},
 {'quantity': 1.0, 'remote': '2', 'timestamp': 3}]

Sorted a based on timestamp and since you need the lastest(maximum),reversed is true

In [58]: s_a=sorted(a,key=lambda x: x['timestamp'],reverse = True)
In [59]: groups=[]
In [60]:for k,g in groupby(s_a,key=lambda x:x['remote']):
    groups.append(list(g))
In [69]: [elem[0] for elem in groups]
Out[69]: 
[{'quantity': 1.0, 'remote': '2', 'timestamp': 3},
 {'quantity': 1.0, 'remote': '1', 'timestamp': 1}]
Ajay
  • 5,267
  • 2
  • 23
  • 30