I'm assuming that by "has some recurring elements in the title", you mean "is a substring of any other title" (within a given situation).
I'm assuming also that you're dealing with relatively small datasets so you won't be concerned with a quadratic algorithm for eliminating redundant strings. Nothing fancy – just construct a set of compatible strings adding one string at a time, checking for substrings:
def find_distinct_strs(all_strs):
distinct_strs = set()
for new_str in all_strs:
for existing_str in distinct_strs:
if new_str in existing_str:
# new_str is redundant, go to next
break
elif existing_str in new_str:
# new_str supersedes existing_str
distinct.remove(existing_str)
else:
distinct_strs.add(new_str)
continue
break
return list(distinct_strs)
You can then group all the entries by situation
, find the distinct titles, and construct a suitably thinned list:
from collections import groupby
def filter_list_dict(list_dict):
return [
dict(title=title, situation=situation)
for situation, entries in groupby(list_dict, lambda entry: entry["situation"])
for title in find_distinct_strs(entry["title"] for entry in entries)
]
Test the output:
> list_dict = [
{'title':'abc defg hij', 'situation':'other'},
{'title':'c defg', 'situation':'other'},
{'title':'defg hij', 'situation':'other'},
{'title':'defg hij', 'situation':'deleted'}
]
> print(filter_list_dict(list_dict))
[{'title': 'abc defg hij', 'situation': 'other'},
{'title': 'defg hij', 'situation': 'deleted'}]