Get only first duplicates in list of dicts with python

Question

I have a list of dicts like this(could have up to 12000 entries though):

[
{'date': datetime.datetime(2016, 1, 31, 0, 0), 'title': 'Entry'}, 
{'date': datetime.datetime(2016, 1, 11, 0, 0), 'title': 'Something'},
{'date': datetime.datetime(2016, 1, 01, 0, 0), 'title': 'Entry'}
]

The first entries are the newest. I want to delete duplicates with same title but keep the oldest ones.

why a list of dicts? Why not one big dictionary with the title as keys and dates as values? then it inherently could not have any duplicates. — Tadhg McDonald-Jensen, May 13 '16 at 20:29
I haven't used python before and have to scrape data from a website. I just took one approach with list of dicts by chance. So no specific reason for myself — Sannin, May 13 '16 at 21:05

Tadhg McDonald-Jensen · Answer 1 · 2016-05-13T20:49:55.290

If you want to keep the list in the format it is in then you can just keep a set of seen unique titles and go through the list either deleting entries or adding to seen:

def r_enumerate(iterable):
    #use itertools.izip and xrange if you are using python 2!
    return zip(reversed(range(len(iterable))), 
               reversed(iterable))

seen = set()
for i, subdata in r_enumerate(data):
    if subdata['title'] in seen:
        del data[i]
    else:
        seen.add(subdata['title'])

This won't modify the order of the data, traversing it backwards means that the later (older) entries are kept, and because you are traversing it backwards you don't have to worry about deleting items messing up the rest of iteration.

On the other hand if you are willing to use a dictionary to store all the entries instead of a list of little dictionaries this is really, really easy:

{partdict['title']: partdict['date'] for partdict in LIST_OF_DICTS}

When evaluating the entries that come later in the list will override the previous ones so this will only keep the oldest entries, not to mention that you can then index the entries by their title instead of their place in the list.

To get back to the list format (but only contain the oldest entry of each name) you can do something like:

[{'title':title, 'date':date} for title,date in DICT_FORM]

Although this will mess up the order and be a lot more work if you want to leave it in this format in the first place.

Thank you for your help. I have already used jDo's solution because it seemed the easiest to adopt in my code. The order of the data is not important for me. I just thought it would be easier if it is known that the last(or the first with reversed list) title is the one to keep. The list is already sorted when i get the data. — Sannin, May 13 '16 at 21:08

jDo · Accepted Answer · 2016-05-13T21:12:43.160

I think this does what you want but I'm also using a dictionary rather than a list. It seems better suited to this type of data:

import datetime

dict_list = [
    {'date': datetime.datetime(2016, 1, 31, 0, 0), 'title': 'Entry'},
    {'date': datetime.datetime(2016, 1, 11, 0, 0), 'title': 'Something'},
    {'date': datetime.datetime(2016, 1, 01, 0, 0), 'title': 'Entry'}
]

dict_keys = set(map(lambda x: x["title"], dict_list))

earliest_entries = {k:min(x["date"] for x in dict_list if x["title"] == k) for k in dict_keys}

Output:

>>> earliest_entries
{'Entry': datetime.datetime(2016, 1, 1, 0, 0), 'Something': datetime.datetime(2016, 1, 11, 0, 0)}
>>>

Get only first duplicates in list of dicts with python

2 Answers2