Check if multiple substrings are in pandas dataframe

Question

I have a pandas dataframe which I want to check for substrings of a certain column. At the moment I have 30 lines of code of this kind:

df['NAME'].str.upper().str.contains('LIMITED')) |
(df['NAME'].str.upper().str.contains('INC')) |
(df['NAME'].str.upper().str.contains('CORP'))

They are all linked with an or condition and if any of them is true, the name is the name of a company rather than a person.

But to me this doesn't seem very elegant. Is there a way to check a pandas string column for "does the string in this column contain any of the substrings in the following list" ['LIMITED', 'INC', 'CORP'].

I found the pandas.DataFrame.isin function, but this is only working for entire strings, not for my substrings.

*Note*: There is a solution [described by @unutbu](https://stackoverflow.com/a/48600345/9209546) which is more efficient than using `pd.Series.str.contains`. If performance is an issue, then this may be worth investigating. — jpp, May 06 '18 at 22:12

Scott Boston · Accepted Answer · 2018-03-27T09:10:05.743

12

You can use regex, where '|' is an "or" in regular expressions:

l = ['LIMITED','INC','CORP']  
regstr = '|'.join(l)
df['NAME'].str.upper().str.contains(regstr)

MVCE:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'NAME':['Baby CORP.','Baby','Baby INC.','Baby LIMITED
   ...: ']})

In [3]: df
Out[3]: 
           NAME
0    Baby CORP.
1          Baby
2     Baby INC.
3  Baby LIMITED

In [4]: l = ['LIMITED','INC','CORP']  
   ...: regstr = '|'.join(l)
   ...: df['NAME'].str.upper().str.contains(regstr)
   ...: 
Out[4]: 
0     True
1    False
2     True
3     True
Name: NAME, dtype: bool

In [5]: regstr
Out[5]: 'LIMITED|INC|CORP'

edited Mar 27 '18 at 09:10

answered Mar 27 '18 at 09:04

Scott Boston

147,308
15
139
187

Can you please suggest something for 'and' condition. I want to check if all the words in my list exist in each row of dataframe. – Syed Md Ismail Mar 12 '21 at 13:37
1

@SyedMdIsmail `'&'.join(l)` – Scott Boston Mar 12 '21 at 14:31
2

I had tried it before asking. It's able to identify the 'or' condition but not the 'and'. Thanks for the reply. – Syed Md Ismail Mar 12 '21 at 14:55

Check if multiple substrings are in pandas dataframe

1 Answers1

Linked

Related