Trying to remove double based on condition from list of dictionaries

Question

I have this list of dictionaries:

list_dict = [
    {'title':'abc defg hij', 'situation':'other'},
    {'title':'c defg', 'situation':'other'},
    {'title':'defg hij', 'situation':'other'},
    {'title':'defg hij', 'situation':'deleted'}]

I'm trying to remove every dictionnary that has some reccuring elements in the title AND the same situation, keeping only the one with the longest string in the title key.

The desired output would be as follows:

[{'title':'abc defg hij', 'situation':'other'},
 {'title':'defg hij', 'situation':'deleted'}]

It's not really clear what "remove every dictionnary that has some reccuring elements in the title" means exactly. — Unmitigated, Apr 10 '23 at 19:26
OK, what have you tried? What do you need help with exactly? Check out [ask]. — wjandrea, Apr 10 '23 at 19:27
[`itertools.groupby`](https://stackoverflow.com/questions/773/how-do-i-use-itertools-groupby) might come in handy. — Chris, Apr 10 '23 at 19:45

score 1 · Answer 1 · answered Apr 10 '23 at 20:15

I'm assuming that by "has some recurring elements in the title", you mean "is a substring of any other title" (within a given situation).

I'm assuming also that you're dealing with relatively small datasets so you won't be concerned with a quadratic algorithm for eliminating redundant strings. Nothing fancy – just construct a set of compatible strings adding one string at a time, checking for substrings:

def find_distinct_strs(all_strs):
    distinct_strs = set()

    for new_str in all_strs:
        for existing_str in distinct_strs:
            if new_str in existing_str:
                # new_str is redundant, go to next
                break
            elif existing_str in new_str:
                # new_str supersedes existing_str
                distinct.remove(existing_str)
        else:
            distinct_strs.add(new_str)
            continue

        break

    return list(distinct_strs)

You can then group all the entries by situation, find the distinct titles, and construct a suitably thinned list:

from collections import groupby
def filter_list_dict(list_dict):
    return [
        dict(title=title, situation=situation)
            for situation, entries in groupby(list_dict, lambda entry: entry["situation"])
                for title in find_distinct_strs(entry["title"] for entry in entries)
    ]

Test the output:

> list_dict = [
    {'title':'abc defg hij', 'situation':'other'},
    {'title':'c defg', 'situation':'other'},
    {'title':'defg hij', 'situation':'other'},
    {'title':'defg hij', 'situation':'deleted'}
]
> print(filter_list_dict(list_dict))
[{'title': 'abc defg hij', 'situation': 'other'},
   {'title': 'defg hij', 'situation': 'deleted'}]

Trying to remove double based on condition from list of dictionaries

1 Answers1