I have a list of objects: with an id, a date and an indication of the type of object. for example
original_list = [{'id':1,'date':'2016-01-01','type':'A'},
{'id':2,'date':'2016-02-01','type':'B'},
{'id':3,'date':'2016-03-01','type':'A'},
{'id':1,'date':'2016-04-01','type':'C'}]
As shown above this list can contain duplicate id's and different dates, types. Now I want to create a list of unique id's which contains only the last entries (based on date). Now I have a procedure as followed:
# Create list of unique id's
unique_ids = list(set([foo.get('id') for foo in original_list]))
# find last contact
for unique_id in unique_ids:
foo_same_id = [foo for foo in original_list if foo.get('id') == unique_id]
if len(foo_same_id) == 1:
# use this one
else:
latest_date = [foo.get('date') for foo in foo_same_id]
latest_date = max(latest_date)
latest_object = [foo for foo in foo_same_id if foo.get('date') == latest_date]
After this the list with the same id's is sorted on the date and is the last value of type used to fill in the type of the object. At that time I don't need these objects anymore and make a copy of the two lists (original_list and unique_ids) without the processed objects/ids.
This seems to work but when applied to 200.000 + it takes a lot of time (+ 4 hours). Are there ways to speed this up? Different implementations? Currently I'm reading in the data from a database and start processing immediately.