How to test if a string contains one of the substrings in a list, in pandas?

Question

Is there any function that would be the equivalent of a combination of df.isin() and df[col].str.contains()?

For example, say I have the series s = pd.Series(['cat','hat','dog','fog','pet']), and I want to find all places where s contains any of ['og', 'at'], I would want to get everything but 'pet'.

I have a solution, but it's rather inelegant:

searchfor = ['og', 'at']
found = [s.str.contains(x) for x in searchfor]
result = pd.DataFrame[found]
result.any()

Is there a better way to do this?

*Note*: There is a solution [described by @unutbu](https://stackoverflow.com/a/48600345/9209546) which is more efficient than using `pd.Series.str.contains`. If performance is an issue, then this may be worth investigating. — jpp, May 06 '18 at 22:09
Highly recommend checking out [this answer](https://stackoverflow.com/a/55335207) for partial string search using multiple keywords/regexes (scroll down to the "**Multiple Substring Search**" subheading). — cs95, Apr 07 '19 at 21:04
In the specific example in the question, you could use `pd.Series.str.endswith` with a tuple argument: https://pandas.pydata.org/docs/reference/api/pandas.Series.str.endswith.html — user7868, Oct 20 '22 at 00:19

Alex Riley · Accepted Answer · 2017-02-28T20:42:50.397

427

One option is just to use the regex | character to try to match each of the substrings in the words in your Series s (still using str.contains).

You can construct the regex by joining the words in searchfor with |:

>>> searchfor = ['og', 'at']
>>> s[s.str.contains('|'.join(searchfor))]
0    cat
1    hat
2    dog
3    fog
dtype: object

As @AndyHayden noted in the comments below, take care if your substrings have special characters such as $ and ^ which you want to match literally. These characters have specific meanings in the context of regular expressions and will affect the matching.

You can make your list of substrings safer by escaping non-alphanumeric characters with re.escape:

>>> import re
>>> matches = ['$money', 'x^y']
>>> safe_matches = [re.escape(m) for m in matches]
>>> safe_matches
['\\$money', 'x\\^y']

The strings with in this new list will match each character literally when used with str.contains.

edited Feb 28 '17 at 20:42

answered Oct 26 '14 at 20:40

Alex Riley

169,130
45
262
238

4

maybe good to add this link http://pandas.pydata.org/pandas-docs/stable/text.html#splitting-and-replacing-strings too. Starting from pandas 0.15, the string operations are even easier – goofd Oct 26 '14 at 21:19
7

one thing you have to take care with is if a string in searchfor has special regex characters (you can [map with re.escape](http://stackoverflow.com/questions/280435/escaping-regex-string-in-python)). – Andy Hayden Oct 26 '14 at 21:24
I don't know why your method doesn't work with "str.startswith('|'.join(searchfor))" – Doo Hyun Shin Feb 17 '19 at 12:59
2

in this case I understand we use "|" for OR, how could we use AND?? – The Dan Feb 11 '21 at 23:31

score 105 · Answer 2 · answered Oct 26 '14 at 21:33

105

You can use str.contains alone with a regex pattern using OR (|):

s[s.str.contains('og|at')]

Or you could add the series to a dataframe then use str.contains:

df = pd.DataFrame(s)
df[s.str.contains('og|at')]

Output:

0 cat
1 hat
2 dog
3 fog

answered Oct 26 '14 at 21:33

l'L'l

44,951
10
95
146

3

how to do it for AND? – JacoSolari Mar 25 '20 at 17:07
3

@JacoSolari check out this answer https://stackoverflow.com/questions/37011734/pandas-dataframe-str-contains-and-operation – James Mar 26 '20 at 10:08
3

@James yes, thanks. For completion here is the most upvoted oneliner in that answer. `df.col.str.contains(r'(?=.*apple)(?=.*banana)',regex=True)` – JacoSolari Mar 26 '20 at 14:13

score 13 · Answer 3 · answered Apr 01 '20 at 21:30

Here is a one line lambda that also works:

df["TrueFalse"] = df['col1'].apply(lambda x: 1 if any(i in x for i in searchfor) else 0)

Input:

searchfor = ['og', 'at']

df = pd.DataFrame([('cat', 1000.0), ('hat', 2000000.0), ('dog', 1000.0), ('fog', 330000.0),('pet', 330000.0)], columns=['col1', 'col2'])

   col1  col2
0   cat 1000.0
1   hat 2000000.0
2   dog 1000.0
3   fog 330000.0
4   pet 330000.0

Apply Lambda:

df["TrueFalse"] = df['col1'].apply(lambda x: 1 if any(i in x for i in searchfor) else 0)

Output:

    col1    col2        TrueFalse
0   cat     1000.0      1
1   hat     2000000.0   1
2   dog     1000.0      1
3   fog     330000.0    1
4   pet     330000.0    0

I did it as `df.loc[df.col1.apply(lambda x: True if any(i in x for i in searchfor) else False)]` and it gone well, thanks. — emremrah, Dec 21 '20 at 05:53

score 0 · Answer 4 · edited Dec 16 '22 at 04:41

0

Had the same issue. Without making it too complex, you can add | in between each entry, like fieldname.str.contains("cat|dog") works

edited Dec 16 '22 at 04:41

Suraj Rao

29,388
11
94
103

answered Dec 16 '22 at 04:26

Mammatt

1
2

Hi there This solution was already provided (https://stackoverflow.com/a/26578218/12439119), try not to duplicate answers. – Alexander L. Hayes Dec 19 '22 at 17:27

How to test if a string contains one of the substrings in a list, in pandas?

4 Answers4

Linked

Related