1

I am looking for a way to check if an element of a list is sub-element of any other elements of that same list?

For example, let's use the below list as an example.

['Lebron James', 'Lebron', 'James']

The 2nd and 3rd elements of this list are a sub-element of the 1st element of the list.

I am looking for a way to remove these elements from the list so only the 1st element remains. I have been spinning my wheels and unable to come up with a solution.

Can someone help?

Thanks

Barmar
  • 741,623
  • 53
  • 500
  • 612
Zak Ray Sick
  • 95
  • 1
  • 8
  • @yatu no, the number of elements in the list can vary. – Zak Ray Sick May 21 '19 at 18:51
  • No, but the solution from @Alex below works well – Zak Ray Sick May 21 '19 at 18:53
  • 4
    Does it have to match a whole word in the longer string, or any substring? E.g. should the 2nd element of `['Lebron James', 'Le']` be returned? – Barmar May 21 '19 at 18:53
  • I hadn't thought about this, but it probably should be whole match. Good point- do you have any thoughts on how to modify the solution below? – Zak Ray Sick May 21 '19 at 18:56
  • also: [Remove items from list that are substrings of other items in the same list \[duplicate\]](https://stackoverflow.com/q/49872752/674039) – wim May 21 '19 at 18:58
  • Use `split()` to split up the long string, and test whether the word is in that list. – Barmar May 21 '19 at 18:59
  • @ZakRaySick You should add that requirement to the question. – Barmar May 21 '19 at 19:01
  • @wim I reopened since the comment says he wants to match whole words, not substrings. Do you have a more appropriate dup before I post an answer? – Barmar May 21 '19 at 19:02
  • I thought about my question a little more and am going to reframe it. – Zak Ray Sick May 21 '19 at 19:03
  • @Barmar Meh, whether the "membership" is via substring or tuple member or overlapping range that does not significantly change the problem (nor the solutions) IMO. – wim May 21 '19 at 19:13
  • @wim I think it changes it significantly, especially if you want to avoid n**2 solutions. – Barmar May 21 '19 at 19:14
  • O.P. Have a look at intervaltree datastructure e.g. [Python representation for a set of non-overlapping integer ranges](https://stackoverflow.com/q/50592912/674039) – wim May 21 '19 at 19:14
  • And I'm usually pretty liberal about what constitutes a duplicate when I close them. – Barmar May 21 '19 at 19:15
  • What about `['Lebron James 1 2 3', 'Lebron James 1 2']`? Keep both or filter out the second? Or not relevant? – Barmar May 21 '19 at 19:25
  • I've reverted the last edit since it seems to make it a completely different question. You should have posted a new question instead of editing this. – Barmar May 21 '19 at 19:44

4 Answers4

3

Here's a slow solution, might be acceptable depending on your data size:

lst = ['Lebron James', 'Lebron', 'James']
[s for s in lst if not any(s in s2.split() for s2 in lst if s != s2)]
Alex Hall
  • 34,833
  • 5
  • 57
  • 89
  • looks unnecessarily O(n**2) – wim May 21 '19 at 18:53
  • 1
    @wim That's why he said it's a slow solution. It will get slower when he addresses the comment that the OP just wants to match whole words, not any substring. – Barmar May 21 '19 at 18:58
1

This is definitely an easier problem to tackle with the starting and ending points for the match instead of the strings themselves.

One approach can be to take all ranges from biggest to smallest, and work backwards, creating the result as you go, given a range is not fully contained in another.

lst = [(0, 10),(0, 4),(5, 10)]

result = []

def membership(big_range, small_range):
    '''return true if big_range fully contains the small_range.
    where both are tuples with a start and end value.
    '''
    if small_range[0] >= big_range[0] and small_range[1] <= big_range[1]:
        return True
    return False

for range_ in sorted(lst, key= lambda x: x[1] - x[0], reverse=True):
    if not any(membership(x, range_) for x in result):
        result.append(range_)

print(result)
#[(0, 10)]

Edit: this answer was in response to the OP'S edited question, which seems to have since been rolled back. Oh well. Hope it helps someone anyways.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Paritosh Singh
  • 6,034
  • 2
  • 14
  • 33
0

Can try to create a dictionary of all permutations (the choice between permutations, or sublists, or whatever, depends on the desired behavior) grouped by element's word count:

import re
import itertools
from collections import defaultdict

lst = [
    'Lebron Raymone James', 'Lebron Raymone', 
    'James', "Le", "Lebron James", 
    'Lebron James 1 2 3', 'Lebron James 1 2'
]

d = defaultdict(dict)
g = "\\b\w+\\b"

for x in lst:
    words = re.findall(g, x)  # could simply use x.split() if have just spaces
    combos = [
        x for i in range(1, len(words) + 1)
        for x in list(itertools.permutations(words, i))
    ]
    for c in combos:
        d[len(words)][tuple(c)] = True

and take just elements whose words are not present in any of the groups with greater words count:

M = max(d) 
res = []
for x in lst:
    words = tuple(re.findall(g, x))
    if not any(d[i].get(words) for i in range(len(words)+1, M+1)):
        res.append(x)
set(res)
# {'Le', 'Lebron James 1 2 3', 'Lebron Raymone James'}
Lante Dellarovere
  • 1,838
  • 2
  • 7
  • 10
-2

Create a set containing all the words in the strings that are multiple words. Then go through the list, testing the strings to see if they're in the set.

wordset = set()
lst = ['Lebron James', 'Lebron', 'James']
for s in lst:
    if " " in s:
        wordset.update(s.split())
result = [x for x in lst if x not in wordset]

Barmar
  • 741,623
  • 53
  • 500
  • 612