53

I have a df (Pandas Dataframe) with three rows:

some_col_name
"apple is delicious"
"banana is delicious"
"apple and banana both are delicious"

The function df.col_name.str.contains("apple|banana") will catch all of the rows:

"apple is delicious",
"banana is delicious",
"apple and banana both are delicious".

How do I apply AND operator to the str.contains() method, so that it only grabs strings that contain BOTH "apple" & "banana"?

"apple and banana both are delicious"

I'd like to grab strings that contains 10-20 different words (grape, watermelon, berry, orange, ..., etc.)

smci
  • 32,567
  • 20
  • 113
  • 146
aerin
  • 20,607
  • 28
  • 102
  • 140
  • 1
    The example is toy because you only have K=2 substrings and they occur in-order: apple, banana. But you're really asking for a method that **matches K=10-20 substrings, in any order**. Regex with multiple lookahead assertions is the way to go (@Anzel's solution). – smci Jul 08 '21 at 04:26

10 Answers10

57

You can do that as follows:

df[(df['col_name'].str.contains('apple')) & (df['col_name'].str.contains('banana'))]
flyingmeatball
  • 7,457
  • 7
  • 44
  • 62
46

You can also do it in regex expression style:

df[df['col_name'].str.contains(r'^(?=.*apple)(?=.*banana)')]

You can then, build your list of words into a regex string like so:

base = r'^{}'
expr = '(?=.*{})'
words = ['apple', 'banana', 'cat']  # example
base.format(''.join(expr.format(w) for w in words))

will render:

'^(?=.*apple)(?=.*banana)(?=.*cat)'

Then you can do your stuff dynamically.

Anzel
  • 19,825
  • 5
  • 51
  • 52
  • This is great. I tried doing it with f-strings. It turned out like this, do you have any improvements? `filter_string = '^' + ''.join(fr'(?=.*{w})' for w in words) ` – spen.smith Aug 11 '20 at 14:29
  • 1
    @spen.smith I think your implementation is clear and simple; don’t think you need to improve it further unless you run into issue – Anzel Aug 11 '20 at 15:29
  • 1
    Anzel solution is solid. However, '^(?=.*apple)(?=.*banana)' works fine but this might need to be modified if the order in which apple and banana may appear is not know before hand. Something like this expression will do when the order is not known '^(?=.*apple)(?=.*banana)|^(?=.*banana)(?=.*apple)'. Also, I would remove ^ to make it search for anywhere in the string not just in the begining. – seakyourpeak Mar 11 '22 at 17:19
37
df = pd.DataFrame({'col': ["apple is delicious",
                           "banana is delicious",
                           "apple and banana both are delicious"]})

targets = ['apple', 'banana']

# Any word from `targets` are present in sentence.
>>> df.col.apply(lambda sentence: any(word in sentence for word in targets))
0    True
1    True
2    True
Name: col, dtype: bool

# All words from `targets` are present in sentence.
>>> df.col.apply(lambda sentence: all(word in sentence for word in targets))
0    False
1    False
2     True
Name: col, dtype: bool
Alexander
  • 105,104
  • 32
  • 201
  • 196
13

This works

df.col.str.contains(r'(?=.*apple)(?=.*banana)',regex=True)
Charan Reddy
  • 514
  • 6
  • 12
6

If you only want to use native methods and avoid writing regexps, here is a vectorized version with no lambdas involved:

targets = ['apple', 'banana', 'strawberry']
fruit_masks = (df['col'].str.contains(string) for string in targets)
combined_mask = np.vstack(fruit_masks).all(axis=0)
df[combined_mask]
Sergey Zakharov
  • 1,493
  • 3
  • 21
  • 40
4

Try this regex

apple.*banana|banana.*apple

Code is:

import pandas as pd

df = pd.DataFrame([[1,"apple is delicious"],[2,"banana is delicious"],[3,"apple and banana both are delicious"]],columns=('ID','String_Col'))

print df[df['String_Col'].str.contains(r'apple.*banana|banana.*apple')]

Output

   ID                           String_Col
2   3  apple and banana both are delicious
pmaniyan
  • 1,046
  • 8
  • 15
3

if you want to catch in the minimum atleast two words in the sentence, maybe this will work (taking the tip from @Alexander) :

target=['apple','banana','grapes','orange']
connector_list=['and']
df[df.col.apply(lambda sentence: (any(word in sentence for word in target)) & (all(connector in sentence for connector in connector_list)))]

output:

                                   col
2  apple and banana both are delicious

if you have more than two words to catch which are separated by comma ',' than add it to the connector_list and modify the second condition from all to any

df[df.col.apply(lambda sentence: (any(word in sentence for word in target)) & (any(connector in sentence for connector in connector_list)))]

output:

                                        col
2        apple and banana both are delicious
3  orange,banana and apple all are delicious
Siraj S.
  • 3,481
  • 3
  • 34
  • 48
3

Enumerating all possibilities for large lists is cumbersome. A better way is to use reduce() and the bitwise AND operator (&).

For example, consider the following DataFrame:

df = pd.DataFrame({'col': ["apple is delicious",
                       "banana is delicious",
                       "apple and banana both are delicious",
                       "i love apple, banana, and strawberry"]})

#                                    col
#0                    apple is delicious
#1                   banana is delicious
#2   apple and banana both are delicious
#3  i love apple, banana, and strawberry

Suppose we wanted to search for all of the following:

targets = ['apple', 'banana', 'strawberry']

We can do:

#from functools import reduce  # needed for python3
print(df[reduce(lambda a, b: a&b, (df['col'].str.contains(s) for s in targets))])

#                                    col
#3  i love apple, banana, and strawberry
pault
  • 41,343
  • 15
  • 107
  • 149
3

You can create masks

apple_mask = df.colname.str.contains('apple')
bannana_mask = df.colname.str.contains('bannana')
df = df [apple_mask & bannana_mask]
1

From @Anzel's answer, I wrote a function since I'm going to be applying this a lot:

def regify(words, base=str(r'^{}'), expr=str('(?=.*{})')):
    return base.format(''.join(expr.format(w) for w in words))

So if you have words defined:

words = ['apple', 'banana']

And then call it with something like:

dg = df.loc[
    df['col_name'].str.contains(regify(words), case=False, regex=True)
]

then you should get what you're after.

Jonny
  • 51
  • 4