2

I have a list of objects: with an id, a date and an indication of the type of object. for example

original_list = [{'id':1,'date':'2016-01-01','type':'A'},
                 {'id':2,'date':'2016-02-01','type':'B'},
                 {'id':3,'date':'2016-03-01','type':'A'},
                 {'id':1,'date':'2016-04-01','type':'C'}]

As shown above this list can contain duplicate id's and different dates, types. Now I want to create a list of unique id's which contains only the last entries (based on date). Now I have a procedure as followed:

# Create list of unique id's
unique_ids = list(set([foo.get('id') for foo in original_list]))

# find last contact
for unique_id in unique_ids:
    foo_same_id = [foo for foo in original_list if foo.get('id') == unique_id]
    if len(foo_same_id) == 1:
        # use this one
    else:
        latest_date = [foo.get('date') for foo in foo_same_id]
        latest_date = max(latest_date)
        latest_object = [foo for foo in foo_same_id if foo.get('date') == latest_date]

After this the list with the same id's is sorted on the date and is the last value of type used to fill in the type of the object. At that time I don't need these objects anymore and make a copy of the two lists (original_list and unique_ids) without the processed objects/ids.

This seems to work but when applied to 200.000 + it takes a lot of time (+ 4 hours). Are there ways to speed this up? Different implementations? Currently I'm reading in the data from a database and start processing immediately.

Mazdak
  • 105,000
  • 18
  • 159
  • 188
mtndoe
  • 424
  • 3
  • 7
  • 21
  • 1
    well obviously you should be sorting at the database level!!! – e4c5 Mar 12 '17 at 09:38
  • Would it be advisable to insert the data into a temp-table perform these operations on this table and then read only the relevant data? Performing these operations in the original query is currently impossible. – mtndoe Mar 12 '17 at 09:45
  • Check that please http://stackoverflow.com/questions/11032015/how-to-find-time-complexity-of-an-algorithm – Ari Gold Mar 12 '17 at 09:49
  • Stop everything and read about about WHERE and ORDER By – e4c5 Mar 12 '17 at 09:58
  • @M.Doe Could you share your DB table's structure or the query using which you're fetching this data? – Ashwini Chaudhary Mar 12 '17 at 10:00
  • @AshwiniChaudhary Sorry, I'm not able to share the structure... of the data I'm fetching. – mtndoe Mar 12 '17 at 10:04

2 Answers2

1

Instead of creating all unique ids using set and other extra operations, and then looping over the list and using all of those extra operations, you can simply use a custom dictionary in order to preserve the your dictionaries based on their ids. And due to the fact that dictionaries only keep the unique items if you override the __setitem__ method in a way that it only replaces the values based on their date (if it's greater than the current one) you'll simply create your desire list.

from datetime import datetime


class UniqueDict(dict):
    def __init__(self, *args, **kwds):
        super(UniqueDict, self).__init__(*args, **kwds)

    def __setitem__(self, _id, value):
        current = self.get(_id)
        if current:
            date_obj = datetime.strptime(value['date'], '%Y-%m-%d')
            current_date_obj = datetime.strptime(self[_id]['date'], '%Y-%m-%d')
            if date_obj > current_date_obj:
                dict.__setitem__(self, _id, value)
        else:
            dict.__setitem__(self, _id, value)

Demo:

original_list = [{'id':1,'date':'2016-01-01','type':'A'},
                 {'id':2,'date':'2016-02-01','type':'B'},
                 {'id':3,'date':'2016-03-01','type':'A'},
                 {'id':1,'date':'2016-04-01','type':'C'}]


udict = UniqueDict()

for d in original_list:
    udict[d['id']] = d

print(udict)

output:

{1: {'id': 1, 'date': '2016-04-01', 'type': 'C'},
 2: {'id': 2, 'date': '2016-02-01', 'type': 'B'},
 3: {'id': 3, 'date': '2016-03-01', 'type': 'A'}}

Note that as mentioned in comment, in this case you can also drop using datetime for converting your date strings to date objects for comparison ,since ISO formatted dates can be compared lexicographically.

Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • You can remove the datetime conversion in this case as ISO 8601 format can be compared lexicographically. – Ashwini Chaudhary Mar 12 '17 at 10:11
  • @AshwiniChaudhary Indeed. Actually I used `datetime` because it's a general and most appropriate way which should be use for other date formats. – Mazdak Mar 12 '17 at 10:15
  • 1
    @Kasramvd this seems somewhat overkill if the data is already ordered, or can easily be ordered... `list({el['id']: el for el in original_list}.values())` if already ordered by date for instance – Jon Clements Mar 12 '17 at 10:25
  • @JonClements In that case yes, surely this will be overkill. – Mazdak Mar 12 '17 at 10:35
0

Dedup the original by using a custom function that only walks the list once and flattens it at the end:

def dedup_original(original):
    items = {}
    for item in original:
        if item['id'] in items:
            if items[item['id']]['date'] < item['date']:
                items[item['id']] = item
        else:
             items[item['id']] = item
    return list(items.values())

Result:

In [28]: dedup_original(original_list)
Out[28]:
[{'date': '2016-04-01', 'id': 1, 'type': 'C'},
 {'date': '2016-02-01', 'id': 2, 'type': 'B'},
 {'date': '2016-03-01', 'id': 3, 'type': 'A'}]
salparadise
  • 5,699
  • 1
  • 26
  • 32