pandas dataframe str.contains() AND operation

Question

I have a df (Pandas Dataframe) with three rows:

some_col_name
"apple is delicious"
"banana is delicious"
"apple and banana both are delicious"

The function df.col_name.str.contains("apple|banana") will catch all of the rows:

"apple is delicious",
"banana is delicious",
"apple and banana both are delicious".

How do I apply AND operator to the str.contains() method, so that it only grabs strings that contain BOTH "apple" & "banana"?

"apple and banana both are delicious"

I'd like to grab strings that contains 10-20 different words (grape, watermelon, berry, orange, ..., etc.)

The example is toy because you only have K=2 substrings and they occur in-order: apple, banana. But you're really asking for a method that **matches K=10-20 substrings, in any order**. Regex with multiple lookahead assertions is the way to go (@Anzel's solution). — smci, Jul 08 '21 at 04:26

score 57 · Accepted Answer · answered May 03 '16 at 18:35

57

You can do that as follows:

df[(df['col_name'].str.contains('apple')) & (df['col_name'].str.contains('banana'))]

answered May 03 '16 at 18:35

flyingmeatball

7,457
7
44
62

Anzel · Answer 2 · 2016-05-03T18:53:43.027

46

You can also do it in regex expression style:

df[df['col_name'].str.contains(r'^(?=.*apple)(?=.*banana)')]

You can then, build your list of words into a regex string like so:

base = r'^{}'
expr = '(?=.*{})'
words = ['apple', 'banana', 'cat']  # example
base.format(''.join(expr.format(w) for w in words))

will render:

'^(?=.*apple)(?=.*banana)(?=.*cat)'

Then you can do your stuff dynamically.

edited May 03 '16 at 18:53

answered May 03 '16 at 18:42

Anzel

19,825
5
51
52

This is great. I tried doing it with f-strings. It turned out like this, do you have any improvements? `filter_string = '^' + ''.join(fr'(?=.*{w})' for w in words) ` – spen.smith Aug 11 '20 at 14:29
1

@spen.smith I think your implementation is clear and simple; don’t think you need to improve it further unless you run into issue – Anzel Aug 11 '20 at 15:29
1

Anzel solution is solid. However, '^(?=.*apple)(?=.*banana)' works fine but this might need to be modified if the order in which apple and banana may appear is not know before hand. Something like this expression will do when the order is not known '^(?=.*apple)(?=.*banana)|^(?=.*banana)(?=.*apple)'. Also, I would remove ^ to make it search for anywhere in the string not just in the begining. – seakyourpeak Mar 11 '22 at 17:19

score 37 · Answer 3 · answered May 03 '16 at 19:57

37

df = pd.DataFrame({'col': ["apple is delicious",
                           "banana is delicious",
                           "apple and banana both are delicious"]})

targets = ['apple', 'banana']

# Any word from `targets` are present in sentence.
>>> df.col.apply(lambda sentence: any(word in sentence for word in targets))
0    True
1    True
2    True
Name: col, dtype: bool

# All words from `targets` are present in sentence.
>>> df.col.apply(lambda sentence: all(word in sentence for word in targets))
0    False
1    False
2     True
Name: col, dtype: bool

answered May 03 '16 at 19:57

Alexander

105,104
32
201
196

1

This solution is more flexible compared to df[df['col_name'].str.contains(r'^(?=.*apple)(?=.*banana)')] but need to evaluate the time it would take compared to regex. – seakyourpeak Mar 11 '22 at 17:21
very pythonic indeed! – quest Aug 15 '22 at 12:26

score 13 · Answer 4 · answered Jul 25 '18 at 18:38

13

This works

df.col.str.contains(r'(?=.*apple)(?=.*banana)',regex=True)

answered Jul 25 '18 at 18:38

Charan Reddy

514
6
12

Sergey Zakharov · Answer 5 · 2019-06-19T10:51:29.873

6

If you only want to use native methods and avoid writing regexps, here is a vectorized version with no lambdas involved:

targets = ['apple', 'banana', 'strawberry']
fruit_masks = (df['col'].str.contains(string) for string in targets)
combined_mask = np.vstack(fruit_masks).all(axis=0)
df[combined_mask]

edited Jun 19 '19 at 10:51

answered Jun 19 '19 at 09:39

Sergey Zakharov

1,493
3
21
40

score 4 · Answer 6 · answered May 03 '16 at 18:54

Try this regex

apple.*banana|banana.*apple

Code is:

import pandas as pd

df = pd.DataFrame([[1,"apple is delicious"],[2,"banana is delicious"],[3,"apple and banana both are delicious"]],columns=('ID','String_Col'))

print df[df['String_Col'].str.contains(r'apple.*banana|banana.*apple')]

Output

   ID                           String_Col
2   3  apple and banana both are delicious

score 3 · Answer 7 · answered May 04 '16 at 20:07

if you want to catch in the minimum atleast two words in the sentence, maybe this will work (taking the tip from @Alexander) :

target=['apple','banana','grapes','orange']
connector_list=['and']
df[df.col.apply(lambda sentence: (any(word in sentence for word in target)) & (all(connector in sentence for connector in connector_list)))]

output:

                                   col
2  apple and banana both are delicious

if you have more than two words to catch which are separated by comma ',' than add it to the connector_list and modify the second condition from all to any

df[df.col.apply(lambda sentence: (any(word in sentence for word in target)) & (any(connector in sentence for connector in connector_list)))]

output:

                                        col
2        apple and banana both are delicious
3  orange,banana and apple all are delicious

score 3 · Answer 8 · answered Mar 12 '18 at 14:05

Enumerating all possibilities for large lists is cumbersome. A better way is to use reduce() and the bitwise AND operator (&).

For example, consider the following DataFrame:

df = pd.DataFrame({'col': ["apple is delicious",
                       "banana is delicious",
                       "apple and banana both are delicious",
                       "i love apple, banana, and strawberry"]})

#                                    col
#0                    apple is delicious
#1                   banana is delicious
#2   apple and banana both are delicious
#3  i love apple, banana, and strawberry

Suppose we wanted to search for all of the following:

targets = ['apple', 'banana', 'strawberry']

We can do:

#from functools import reduce  # needed for python3
print(df[reduce(lambda a, b: a&b, (df['col'].str.contains(s) for s in targets))])

#                                    col
#3  i love apple, banana, and strawberry

score 3 · Answer 9 · answered Mar 23 '22 at 12:37

3

You can create masks

apple_mask = df.colname.str.contains('apple')
bannana_mask = df.colname.str.contains('bannana')
df = df [apple_mask & bannana_mask]

answered Mar 23 '22 at 12:37

Vaibhav Gupta

41
1

score 1 · Answer 10 · answered Dec 12 '21 at 14:08

From @Anzel's answer, I wrote a function since I'm going to be applying this a lot:

def regify(words, base=str(r'^{}'), expr=str('(?=.*{})')):
    return base.format(''.join(expr.format(w) for w in words))

So if you have words defined:

words = ['apple', 'banana']

And then call it with something like:

dg = df.loc[
    df['col_name'].str.contains(regify(words), case=False, regex=True)
]

then you should get what you're after.

pandas dataframe str.contains() AND operation

10 Answers10

Linked

Related