2

I would like to check if items in a List are in a Column from my DF.

The basics where straightforward:

fruit = ['apple','banana']    # This items should be in the column 
fruit = ', '.join(fruit)      # Think this is the point where it goes wrong... 

fruit_resulst = df['all_fruit'].str.contains(fruit) # Check if column contains fruit 
df_new = df[fruit_resulst]   # Filter so that we only keep the TRUEs 

This works, but not completely. It only works in this specific order, but I would like to have it working in all orders (e.g., if a column row contains ALL items from the list, then I would like to keep them. Else, remove.

df['all_fruit']

Apple, Banana             #Return! Because it contains apple and banana
Banana                    # Do not return 
Banana, Apple             #Return! Because it contains apple and banana    
Apple                     # Do not return
Apple, Banana, Peer       #Return! Because it contains apple and banana

Thanks a lot in advance!

R overflow
  • 1,292
  • 2
  • 17
  • 37

3 Answers3

2

Convert values to lowercase, then split to lists and test issubset by convert fruit to set:

df1 = df[df.all_fruit.str.lower().str.split(', ').map(set(fruit).issubset)]
print (df1)
             all_fruit
0        Apple, Banana
2        Banana, Apple
4  Apple, Banana, Peer

Your solution with list of boolean masks passed to np.logical_and.reduce:

df1 = df[np.logical_and.reduce([df.all_fruit.str.contains(f, case=False) for f in fruit])]
print (df1)
             all_fruit
0        Apple, Banana
2        Banana, Apple
4  Apple, Banana, Peer
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    This is perfect. Quick one (sorry to bother, hope you can help) - would it also be possible to add a column to the original DF, which shows FALSE, except when the match is True? Then it shows True? – R overflow Aug 16 '21 at 11:21
  • 1
    @Roverflow - For first solution `df['test'] = ~df.all_fruit.str.lower().str.split(', ').map(set(fruit).issubset)`, for second `df['test'] = ~np.logical_and.reduce([df.all_fruit.str.contains(f, case=False) for f in fruit])` – jezrael Aug 16 '21 at 11:22
  • 1
    @Roverflow - So it means False, True, False, true, False ? – jezrael Aug 16 '21 at 11:24
  • thanks a lot! The other way around... True, False, True, False, True :-) – R overflow Aug 16 '21 at 11:27
  • 1
    @Roverflow - Then remove `~` for invert mask – jezrael Aug 16 '21 at 11:41
1
df = pd.DataFrame({'all_fruit': [
    'Apple, Banana',
    'Banana',
    'Banana, Apple',
    'Apple',
    'Apple, Banana, Peer',
]})
fruit = ['apple','banana']
have_fruits = [df.all_fruit.str.contains(f, case=False) for f in fruit]
indexes = True
for f in have_fruits:
    indexes = indexes * f
df[indexes]
Behzad Shayegh
  • 323
  • 1
  • 10
1

Try this code:

x = df['all_fruit'].str.split(',', expand=True)
print(df[x.replace('Apple', '').ne(x).any(1) & x.replace(' Banana', '').ne(x).any(1)])

Output:

             all_fruit
0        Apple, Banana
2        Banana, Apple
4  Apple, Banana, Peer
U13-Forward
  • 69,221
  • 14
  • 89
  • 114