Check if Multiple Strings are present in a DataFrame Column

Question

I would like to check if items in a List are in a Column from my DF.

The basics where straightforward:

fruit = ['apple','banana']    # This items should be in the column 
fruit = ', '.join(fruit)      # Think this is the point where it goes wrong... 

fruit_resulst = df['all_fruit'].str.contains(fruit) # Check if column contains fruit 
df_new = df[fruit_resulst]   # Filter so that we only keep the TRUEs

This works, but not completely. It only works in this specific order, but I would like to have it working in all orders (e.g., if a column row contains ALL items from the list, then I would like to keep them. Else, remove.

df['all_fruit']

Apple, Banana             #Return! Because it contains apple and banana
Banana                    # Do not return 
Banana, Apple             #Return! Because it contains apple and banana    
Apple                     # Do not return
Apple, Banana, Peer       #Return! Because it contains apple and banana

Thanks a lot in advance!

jezrael · Accepted Answer · 2021-08-16T10:26:47.933

2

Convert values to lowercase, then split to lists and test issubset by convert fruit to set:

df1 = df[df.all_fruit.str.lower().str.split(', ').map(set(fruit).issubset)]
print (df1)
             all_fruit
0        Apple, Banana
2        Banana, Apple
4  Apple, Banana, Peer

Your solution with list of boolean masks passed to np.logical_and.reduce:

df1 = df[np.logical_and.reduce([df.all_fruit.str.contains(f, case=False) for f in fruit])]
print (df1)
             all_fruit
0        Apple, Banana
2        Banana, Apple
4  Apple, Banana, Peer

edited Aug 16 '21 at 10:26

answered Aug 16 '21 at 10:18

jezrael

822,522
95
1,334
1,252

1

This is perfect. Quick one (sorry to bother, hope you can help) - would it also be possible to add a column to the original DF, which shows FALSE, except when the match is True? Then it shows True? – R overflow Aug 16 '21 at 11:21
1

@Roverflow - For first solution `df['test'] = ~df.all_fruit.str.lower().str.split(', ').map(set(fruit).issubset)`, for second `df['test'] = ~np.logical_and.reduce([df.all_fruit.str.contains(f, case=False) for f in fruit])` – jezrael Aug 16 '21 at 11:22
1

@Roverflow - So it means False, True, False, true, False ? – jezrael Aug 16 '21 at 11:24
thanks a lot! The other way around... True, False, True, False, True :-) – R overflow Aug 16 '21 at 11:27
1

@Roverflow - Then remove `~` for invert mask – jezrael Aug 16 '21 at 11:41

score 1 · Answer 2 · answered Aug 16 '21 at 10:21

df = pd.DataFrame({'all_fruit': [
    'Apple, Banana',
    'Banana',
    'Banana, Apple',
    'Apple',
    'Apple, Banana, Peer',
]})
fruit = ['apple','banana']
have_fruits = [df.all_fruit.str.contains(f, case=False) for f in fruit]
indexes = True
for f in have_fruits:
    indexes = indexes * f
df[indexes]

score 1 · Answer 3 · answered Aug 16 '21 at 10:26

1

Try this code:

x = df['all_fruit'].str.split(',', expand=True)
print(df[x.replace('Apple', '').ne(x).any(1) & x.replace(' Banana', '').ne(x).any(1)])

Output:

             all_fruit
0        Apple, Banana
2        Banana, Apple
4  Apple, Banana, Peer

answered Aug 16 '21 at 10:26

U13-Forward

69,221
14
89
114

Check if Multiple Strings are present in a DataFrame Column

3 Answers3