Check if element of list is sub-element of other list elements in same list

Question

I am looking for a way to check if an element of a list is sub-element of any other elements of that same list?

For example, let's use the below list as an example.

['Lebron James', 'Lebron', 'James']

The 2nd and 3rd elements of this list are a sub-element of the 1st element of the list.

I am looking for a way to remove these elements from the list so only the 1st element remains. I have been spinning my wheels and unable to come up with a solution.

Can someone help?

Thanks

Does it have to match a whole word in the longer string, or any substring? E.g. should the 2nd element of `['Lebron James', 'Le']` be returned? — Barmar, May 21 '19 at 18:53
I hadn't thought about this, but it probably should be whole match. Good point- do you have any thoughts on how to modify the solution below? — Zak Ray Sick, May 21 '19 at 18:56
also: [Remove items from list that are substrings of other items in the same list \[duplicate\]](https://stackoverflow.com/q/49872752/674039) — wim, May 21 '19 at 18:58
Use `split()` to split up the long string, and test whether the word is in that list. — Barmar, May 21 '19 at 18:59
@ZakRaySick You should add that requirement to the question. — Barmar, May 21 '19 at 19:01
@wim I reopened since the comment says he wants to match whole words, not substrings. Do you have a more appropriate dup before I post an answer? — Barmar, May 21 '19 at 19:02
I thought about my question a little more and am going to reframe it. — Zak Ray Sick, May 21 '19 at 19:03
@Barmar Meh, whether the "membership" is via substring or tuple member or overlapping range that does not significantly change the problem (nor the solutions) IMO. — wim, May 21 '19 at 19:13
@wim I think it changes it significantly, especially if you want to avoid n**2 solutions. — Barmar, May 21 '19 at 19:14
O.P. Have a look at intervaltree datastructure e.g. [Python representation for a set of non-overlapping integer ranges](https://stackoverflow.com/q/50592912/674039) — wim, May 21 '19 at 19:14
And I'm usually pretty liberal about what constitutes a duplicate when I close them. — Barmar, May 21 '19 at 19:15
What about `['Lebron James 1 2 3', 'Lebron James 1 2']`? Keep both or filter out the second? Or not relevant? — Barmar, May 21 '19 at 19:25
I've reverted the last edit since it seems to make it a completely different question. You should have posted a new question instead of editing this. — Barmar, May 21 '19 at 19:44

Alex Hall · Answer 1 · 2019-05-21T19:56:22.770

3

Here's a slow solution, might be acceptable depending on your data size:

lst = ['Lebron James', 'Lebron', 'James']
[s for s in lst if not any(s in s2.split() for s2 in lst if s != s2)]

edited May 21 '19 at 19:56

answered May 21 '19 at 18:50

Alex Hall

34,833
5
57
89

looks unnecessarily O(n**2) – wim May 21 '19 at 18:53
1

@wim That's why he said it's a slow solution. It will get slower when he addresses the comment that the OP just wants to match whole words, not any substring. – Barmar May 21 '19 at 18:58

score 1 · Answer 2 · edited Oct 22 '21 at 09:00

This is definitely an easier problem to tackle with the starting and ending points for the match instead of the strings themselves.

One approach can be to take all ranges from biggest to smallest, and work backwards, creating the result as you go, given a range is not fully contained in another.

lst = [(0, 10),(0, 4),(5, 10)]

result = []

def membership(big_range, small_range):
    '''return true if big_range fully contains the small_range.
    where both are tuples with a start and end value.
    '''
    if small_range[0] >= big_range[0] and small_range[1] <= big_range[1]:
        return True
    return False

for range_ in sorted(lst, key= lambda x: x[1] - x[0], reverse=True):
    if not any(membership(x, range_) for x in result):
        result.append(range_)

print(result)
#[(0, 10)]

Edit: this answer was in response to the OP'S edited question, which seems to have since been rolled back. Oh well. Hope it helps someone anyways.

Lante Dellarovere · Answer 3 · 2019-05-21T21:27:01.857

Can try to create a dictionary of all permutations (the choice between permutations, or sublists, or whatever, depends on the desired behavior) grouped by element's word count:

import re
import itertools
from collections import defaultdict

lst = [
    'Lebron Raymone James', 'Lebron Raymone', 
    'James', "Le", "Lebron James", 
    'Lebron James 1 2 3', 'Lebron James 1 2'
]

d = defaultdict(dict)
g = "\\b\w+\\b"

for x in lst:
    words = re.findall(g, x)  # could simply use x.split() if have just spaces
    combos = [
        x for i in range(1, len(words) + 1)
        for x in list(itertools.permutations(words, i))
    ]
    for c in combos:
        d[len(words)][tuple(c)] = True

and take just elements whose words are not present in any of the groups with greater words count:

M = max(d) 
res = []
for x in lst:
    words = tuple(re.findall(g, x))
    if not any(d[i].get(words) for i in range(len(words)+1, M+1)):
        res.append(x)
set(res)
# {'Le', 'Lebron James 1 2 3', 'Lebron Raymone James'}

score -2 · Answer 4 · answered May 21 '19 at 19:07

-2

Create a set containing all the words in the strings that are multiple words. Then go through the list, testing the strings to see if they're in the set.

wordset = set()
lst = ['Lebron James', 'Lebron', 'James']
for s in lst:
    if " " in s:
        wordset.update(s.split())
result = [x for x in lst if x not in wordset]

answered May 21 '19 at 19:07

Barmar

741,623
53
500
612

2

Your algorithm is bad. Consider input `['Lebron James 1 2 3', 'Lebron James 1 2']` – wim May 21 '19 at 19:19
I'm not sure if that's something he wants to filter out, we should ask. – Barmar May 21 '19 at 19:24

Check if element of list is sub-element of other list elements in same list

4 Answers4