Best way to delete duplicate values in a list of dictionaries?

Question

I have a list of dictionaries, and those dictionaries have another nested dictionary. Here is an example:

reports = [
            {'00T2A00003mDvq9': {'subject': 'dupe1', 'due_date': '4/5/2017'}}
            {'00T2A00003mDvq8': {'subject': 'dupe2', 'due_date': '4/7/2017'}}
            {'00T2A00003mDvq7': {'subject': 'dupe1', 'due_date': '4/3/2017'}}
          ]

So each dict in the list has a unique id and values associated with it.

I need a way to iterate through these dictionaries and if any of them have an exact match in the 'subject' field then I want to delete/remove the entire dict with the latest date.

So, using the example above, after iterating through the list and de-duping, I need the result to look like this.

reports = [
            {'00T2A00003mDvq8': {'subject': 'dupe2', 'due_date': '4/7/2017'}}
            {'00T2A00003mDvq9': {'subject': 'dupe1', 'due_date': '4/3/2017'}}
          ]

It deletes the first instance of 'dupe1' because it is the later date.

What have you tried and what precisely is the problem with it? — jonrsharpe, Apr 08 '17 at 19:23
I've seen a couple examples that iterate though a list of dictionaries, but none with nested dictionaries like I have above. — bbennett36, Apr 08 '17 at 19:24
That's not what I asked; SO isn't a code-writing service, you're expected to put some effort into an actual implementation yourself. Additionally, please don't revert legitimate edits; ask **one question** at a time (preferably after reading [ask]). — jonrsharpe, Apr 08 '17 at 19:25
If I knew the solution or knew how to go about it I wouldn't be posting here. From SO about us page: "With your help, we're working together to build a library of detailed answers to every question about programming.". I'm not looking for a code-writing service. I'm looking for an answer to my question. And I threw the bonus question in here because I'm guessing if it was sorted, it would be easier to do what I'm looking for — bbennett36, Apr 08 '17 at 19:29
And to answer your question would be writing your code for you, which isn't useful to anyone but you and therefore an inefficient use of people's time. There are plenty of existing answers about manipulating lists and dictionaries, which is a basic concept also covered in many existing tutorials; please do some proper research. — jonrsharpe, Apr 08 '17 at 19:31
Could you show me somewhere that has an answer related to the question above? — bbennett36, Apr 08 '17 at 19:32
If the second instance with `'dupe1'` was deleted, which has the latest `'due_date'` of `'4/5/2017'`, the results would have to contain the first one, namely: `{'00T2A00003mDvq7': {'subject': 'dupe1', 'due_date': '4/3/2017'}}`, **not** what is shown in your question. — martineau, Apr 08 '17 at 20:59
@martineau that was a type, sorry about that. The dates were different, I changed them so they weren't in order and forgot to change that part. — bbennett36, Apr 08 '17 at 21:25

Eric Duminil · Answer 1 · 2017-04-08T20:31:02.083

Since you're completely stuck, here's a start. One problem is that for each dict, the key is different and unknown. It looks like there's only one pair in each dict, so you can get items() and take the first one:

reports = [ 
    {'00T2A00003mDvq9': {'subject': 'dupe1', 'due_date': '4/5/2017'}},
    {'00T2A00003mDvq8': {'subject': 'dupe2', 'due_date': '4/7/2017'}},
    {'00T2A00003mDvq7': {'subject': 'dupe1', 'due_date': '4/3/2017'}}
]

def get_subject(some_dict):
    return list(some_dict.items())[0][1]['subject']

reports.sort(key=get_subject)
print(reports)
# [{'00T2A00003mDvq9': {'due_date': '4/5/2017', 'subject': 'dupe1'}}, {'00T2A00003mDvq7': {'due_date': '4/3/2017', 'subject': 'dupe1'}}, {'00T2A00003mDvq8': {'due_date': '4/7/2017', 'subject': 'dupe2'}}]

reports is now sorted by subject. You can then use groupby to get reports grouped by subject.

For each group, you can use sort again, this time with due_date. You'll have to take care though, you cannot sort dates alphabetically, you'll need to extract year,month,day in this order or convert the string to a datetime object with strptime.

Once your results are grouped by subject and sorted by due_date, just get the first element of each group. Done!

Also it would be a good idea to use datetime for date sorting, instead of reimplementing the wheel. [You can write datetimes to strings and parse them again to datetime objects if necessary.](http://stackoverflow.com/questions/466345/converting-string-into-datetime). — Pablo Arias, Apr 08 '17 at 20:23

martineau · Accepted Answer · 2017-04-11T14:44:43.723

The problem is made more difficult because you don't know the key values (unique ids) of the dictionaries in reports. Since each one consists of only one item, you can use next(iter(dict.values())) with Python 3 to get the single nested dictionary associated with it—which I called checkout in the code below to give it a name.

Given that, the approach I would use would be to first create a dictionary that groups the elements in reports by subject—which then gives you something like this to work with (note: I changed the sample reports data so the first has more than one with a duplicate 'subject'):

{
    'dupe1': [
        {'00T2A00003mDvq9': {'due_date': '4/5/2017', 'subject': 'dupe1'}},
        {'00T2A00003mDvq7': {'due_date': '4/3/2017', 'subject': 'dupe1'}},
        {'00T2A00003mDvq6': {'due_date': '4/6/2017', 'subject': 'dupe1'}}
    ],
    'dupe2': [
        {'00T2A00003mDvq8': {'due_date': '4/7/2017', 'subject': 'dupe2'}}
    ]
}

The lists of reports associated with each subject can then be sorted by date (using a lambda based on the same next(iter(dict.values())) trick), and given the now ordered contents of that, it's easy to update the list and remove any duplicates in accordance to your desires.

from time import strptime
from pprint import pprint

DATE_FMT = '%m/%d/%Y'
reports = [
    {'00T2A00003mDvq9': {'subject': 'dupe1', 'due_date': '4/5/2017'}},
    {'00T2A00003mDvq8': {'subject': 'dupe2', 'due_date': '4/7/2017'}},
    {'00T2A00003mDvq7': {'subject': 'dupe1', 'due_date': '4/3/2017'}},
    {'00T2A00003mDvq6': {'subject': 'dupe1', 'due_date': '4/6/2017'}},  # + a third duplicate
]

by_subject = {}
for report in reports:
    checkout = next(iter(report.values()))  # get single subdictionary in each dictionary
    by_subject.setdefault(checkout['subject'], []).append(report)

for records in by_subject.values():
    records.sort(key=lambda rpt: strptime(next(iter(rpt.values()))['due_date'], DATE_FMT))

# Update reports list in-place.
del reports[:]
for subject, records in by_subject.items():
    reports.append(records[0])  # only keep oldest (deletes all newer than first)

print('Deduped reports:')
pprint(reports)

Output:

Deduped reports:
[{'00T2A00003mDvq7': {'due_date': '4/3/2017', 'subject': 'dupe1'}},
 {'00T2A00003mDvq8': {'due_date': '4/7/2017', 'subject': 'dupe2'}}]

This almost works and is going in the right direction. One problem is I can have many duplicates, this one only deletes 1. Also, I don't need to keep the de-duped reports. Looking to keep the original list minus the duplicates. — bbennett36, Apr 08 '17 at 22:37
Your question says "I want to delete/remove **the** entire dict with the latest date" (emphasis mine) which seems to indicate deleting only one. Regardless, I've updated my answer according (I think) based on your feedback. — martineau, Apr 09 '17 at 03:40
I posted my final solution. I think your's didn't fully work because I'm using python3, but it's almost exactly your answer. Thank you! — bbennett36, Apr 10 '17 at 21:03
Yep, missed the Python 3 tag, sorry. Trivial to fix, though (see updated answer)...and you're welcome. — martineau, Apr 10 '17 at 21:11

score 0 · Answer 3 · answered Apr 10 '17 at 21:01

This is the final solution that I went with. Based off @martineau's answer, but I'm guessing its only because im using Python3.

from time import strptime

DATE_FMT = '%m/%d/%Y'
reports = [
    {'00T2A00003mDvq9': {'subject': 'dupe1', 'due_date': '4/5/2017'}},
    {'00T2A00003mDvq8': {'subject': 'dupe2', 'due_date': '4/7/2017'}},
    {'00T2A00003mDvq7': {'subject': 'dupe1', 'due_date': '4/3/2017'}},
    {'00T2A00003mDvq6': {'subject': 'dupe1', 'due_date': '4/6/2017'}},  # + third duplicate
]

DATE_FMT = '%m/%d/%Y'

    by_subject = {}
    for report in reports:
        topic = list(report.values())[0]
        # assuming only one element in each dictionary
        by_subject.setdefault(topic['subject'], []).append(report)

    for records in by_subject.values():
        records.sort(key=lambda rec: strptime(list(rec.values())[0]['due_date'], DATE_FMT))

    reports = []

    for subject, records in by_subject.items():

        if len(records) > 1:
            while len(records) != 1:
                del records[-1]
        reports.extend(records)

Best way to delete duplicate values in a list of dictionaries?

3 Answers3