Remove strings from a list if they are not in another list of single-item lists of strings

Question

I have two lists of strings as follow:

good_tags = ['c#', '.net', 'java']

all_tags = [['c# .net datetime'],
            ['c# datetime time datediff relative-time-span'], 
            ['html browser timezone user-agent timezone-offset']]

My target is to keep only the 'good_tags' from the list of strings in the 'all_tags', for instance,

the first row of 'all_tags':[c# .net datetime]
shall become (based on list of strings I want to keep in 'good_tags') : [c# .net]

I have tried with 'in' instead of 'not in', based on Remove all the elements that occur in one list from another

y3 = [x for x in all_tags if x in good_tags]
print ('y3: ', y3) 
y4 = [x for x in good_tags if x in all_tags]
print ('y4: ', y4)

OUT:

y3:  []
y4:  []

`['c# .net datetime'] in ['c#', '.net', 'java']` evaluates to False. If you actually have e.g. `['c# .net datetime']` and want `['c# .net']` you need to be splitting and joining, and checking each word individually. — jonrsharpe, Sep 24 '20 at 09:28
Why consists `all_tags` of lists with only one element? There might be a better way to build this list in the first place. — Wups, Sep 24 '20 at 09:44

score 2 · Answer 1 · answered Sep 24 '20 at 09:45

good_tags = ['c#', '.net', 'java']

all_tags = [
    ['c# .net datetime'],
    ['c# datetime time datediff relative-time-span'],
    ['html browser timezone user-agent timezone-offset']
]

filtered_tags = [[" ".join(filter(lambda tag: tag in good_tags, row[0].split()))] for row in all_tags]
print(filtered_tags)

Output:

[['c# .net'], ['c#'], ['']]
>>>

score 1 · Answer 2 · answered Sep 24 '20 at 09:30

First of all, you don't have two lists of strings. You have list of list of strings.

good_tags = ['c#', '.net', 'java']

all_tags = [['c# .net datetime'],['c# datetime time datediff relative-time-span'], ['html browser timezone user-agent timezone-offset']]

all_tags_with_good_tags = []

for tags in all_tags:
    new_good_tags = set()
    for tag in tags[0].split():  # here you have list, so you need to select 0 element 
                                 #  of it as there's only 1 list element in your example 
                                 #  and then split it on the whitespace to be a list of tags
        if tag in good_tags:
            new_good_tags.add(tag)
    if new_good_tags:
        all_tags_with_good_tags.append(' '.join(new_good_tags))

Will get you

['.net c#', 'c#']

score 1 · Answer 3 · answered Sep 24 '20 at 09:40

There might be better way to do this, but here it is,

good_tags = ['c#', '.net', 'java']

all_tags = [['c# .net datetime'],['c# datetime time datediff relative-time-span'], ['html browser timezone user-agent timezone-offset']]

for tags in all_tags:
    empty = []
    for tag in tags[0].split(" "):
        if tag in good_tags:
            empty.append(tag)
    print(" ".join(empty))

score 1 · Answer 4 · answered Sep 24 '20 at 09:44

Your all_tags is a list, which contains three lists, where each list contains one string. So what you need to do first, is to convert each sublist into a list that holds strings, and not just one string.

Since you just have spaces there, that are seperating the tags and no comma, you have to transform the list from ['c# .net datetime'] to ['c#', '.net', 'datetime'] with:

[x for segments in all_tags[0] for x in segments.split()]

And then you can do this for your entire list, so iterate over the length of it:

[[x for segments in all_tags[entry] for x in segments.split()] for entry in range(len(all_tags))]

which returns:

[['c#', '.net', 'datetime'],
 ['c#', 'datetime', 'time', 'datediff', 'relative-time-span'],
 ['html', 'browser', 'timezone', 'user-agent', 'timezone-offset']]

And now you can filter this list according to your good tags:

y3 = [[x for x in [words for segments in all_tags[entry] for words in segments.split()] if x in good_tags] for entry in range(len(all_tags))]

Output:

[['c#', '.net'], ['c#'], []]

prabh · Answer 5 · 2020-09-24T09:56:30.770

good_tags = ['c#', '.net', 'java']

all_tags = [['c# .net datetime'],['c# datetime time datediff relative-time-span'], ['html browser timezone user-agent timezone-offset']]

new_tags = []

for _ in all_tags:
    tags = _[0].split()
    newtag = ''
    for tag in tags:
        if tag in good_tags:
            if newtag == '':
                newtag = tag
            else:
                newtag = newtag + ' ' + tag
                
    if newtag != '':
        l = []
        l.append(newtag)
        new_tags.append(l)
        
print(new_tags)

score 1 · Answer 6 · answered Sep 24 '20 at 09:56

1

good_set = set(good_tags)
kept_tags = [[t for t in tags[0].split() if t in good_set] 
    for tags in all_tags]
print(kept_tags)
# [['c#', '.net'], ['c#'], []]

answered Sep 24 '20 at 09:56

tzaman

46,925
11
90
115

score 1 · Accepted Answer · edited Sep 24 '20 at 10:08

1st Statement: When "x in all_tags" execute it will give ['c# .net datetime'] that is list class and 'c# .net datetime' is a single string it is not treated separately.

2nd Statement: after first statement x = ['c# .net datetime'], that is list now this list will be searched in good_tags that does not contain this whole list so nothing is returned.

Condition 1: if our good_tags is like ['c#', '.net', 'java', ['c# .net datetime'] ] then it will return ['c# .net datetime']

Here is problem for your solution:

good_tags = ['c#', '.net', 'java']

all_tags = [['c# .net datetime'], ['c# datetime time datediff relative-time-span'],
            ['html browser timezone user-agent timezone-offset']]


#y3 = [x for x in all_tags if x in good_tags]
all_tags_refine = []
for x in all_tags:
    y = x[0].split()

    z = [k for k in y if k in good_tags]
    all_tags_refine.append(z)

print(all_tags_refine)

Remove strings from a list if they are not in another list of single-item lists of strings

7 Answers7