Check if a pandas Dataframe string column contains all the elements given in an array

Question

I have a dataframe as shown below:

>>> import pandas as pd
>>> df = pd.DataFrame(data = [['app;',1,2,3],['app; web;',4,5,6],['web;',7,8,9],['',1,4,5]],columns = ['a','b','c','d'])
>>> df
           a  b  c  d
0       app;  1  2  3
1  app; web;  4  5  6
2       web;  7  8  9
3             1  4  5

I have an input array that looks like this: ["app","web"] For each of these values I want to check against a specific column of a dataframe and return a decision as shown below:

>>> df.a.str.contains("app")
0     True
1     True
2    False
3    False

Since str.contains only allows me to look for an individual value, I was wondering if there's some other direct way to determine the same something like:

 df.a.str.contains(["app","web"]) # Returns TypeError: unhashable type: 'list'

My end goal is not to do an absolute match (df.a.isin(["app", "web"]) but rather a 'contains' logic that says return true even if it has those characters present in that cell of data frame.

Note: I can of course use apply method to create my own function for the same logic such as:

elementsToLookFor = ["app","web"]
df[header] = df.apply(lambda element: all([a in element for a in elementsToLookFor]))

But I am more interested in the optimal algorithm for this and so prefer to use a native pandas function within pandas, or else the next most optimized custom solution.

Similar to this question [Check if pandas column contains all elements from a list](https://stackoverflow.com/q/60932036/15497888) — Henry Ecker, Jul 08 '21 at 03:33
I saw your previous comment `df.a.str.contains("|".join(["app","web"]))` which almost solves my case except for the fact I was looking for `&` operation. I am not very good with regex and was thinking if there's a way to `&` it instead of `\` with your previous logic. something like: `df.a.str.contains("&".join(["app","web"]))` — Shivam Sahil, Jul 08 '21 at 03:38
There isn't an "and" type operation in that context for regular expressions. — Henry Ecker, Jul 08 '21 at 03:44
From the linked question though: `np.all([df['a'].str.contains(v) for v in ['app', 'web']], axis=0)` — Henry Ecker, Jul 08 '21 at 03:46
Previous asking [pandas dataframe `str.contains()` AND operation](https://stackoverflow.com/questions/37011734/pandas-dataframe-str-contains-and-operation). There are many duplicates on this in SO. — smci, Jul 08 '21 at 04:20
`["app","web"]` is not an 'array', it's a list. Please use Python terminology not Javascript/JSON. — smci, Jul 08 '21 at 04:22
There is no need to do `get_dummies()`. There are plenty of existing solutionson SO showing matching multiple substrings, in any order, on a pandas string column. Please edit the title and body if you want to rephrase the question. — smci, Jul 08 '21 at 04:28
@smci the link which you shared is more of a static solution, you need to do anding of defined values and that's why I was looking for a generalized solution — Shivam Sahil, Jul 08 '21 at 04:56
@ShivamSahil: then can post a better solution there. It was one of many duplicates on SO; you can also find another duplicate and cite it here. It is not good for SO to keep posting and reasking the same questions. — smci, Jul 08 '21 at 19:36

rhug123 · Answer 1 · 2021-07-09T03:31:51.000

2

This should work too:

l = ["app","web"]
df['a'].str.findall('|'.join(l)).map(lambda x: len(set(x)) == len(l))

also this should work as well:

pd.concat([df['a'].str.contains(i) for i in l],axis=1).all(axis = 1)

edited Jul 09 '21 at 03:31

answered Jul 09 '21 at 03:24

rhug123

7,893
1
9
24

1

wow so many solutions, which one is the most efficient of the two according to you? – Shivam Sahil Jul 09 '21 at 06:20
1

I believe that the first solution is the fastest (`findall`) – rhug123 Jul 09 '21 at 13:02
1

i ran some timings and it looks like the first one (`findall`) is faster for smaller dfs, but the second one scales better – tdy Jul 11 '21 at 06:49

tdy · Answer 2 · 2021-07-11T07:12:51.427

so many solutions, which one is the most efficient

The str.contains-based answers are generally fastest, though str.findall is also very fast on smaller dfs:

values = ['app', 'web']
pattern = ''.join(f'(?=.*{value})' for value in values)

def replace_dummies_all(df):
    return df.a.str.replace(' ', '').str.get_dummies(';')[values].all(1)

def findall_map(df):
    return df.a.str.findall('|'.join(values)).map(lambda x: len(set(x)) == len(values))

def lower_contains(df):
    return df.a.astype(str).str.lower().str.contains(pattern)

def contains_concat_all(df):
    return pd.concat([df.a.str.contains(l) for l in values], axis=1).all(1)

def contains(df):
    return df.a.str.contains(pattern)

score 1 · Accepted Answer · edited Jul 09 '21 at 01:22

1

Try with str.get_dummies

df.a.str.replace(' ','').str.get_dummies(';')[['web','app']].all(1)
0    False
1     True
2    False
3    False
dtype: bool

Update

df['a'].str.contains(r'^(?=.*web)(?=.*app)')

Update 2: (To ensure case insenstivity doesn't matter and the column dtype is str without which the logic may fail):

elementList = ['app','web']
for eachValue in elementList:
                    valueString += f'(?=.*{eachValue})'
df[header] = df[header].astype(str).str.lower() #To ensure case insenstivity and the dtype of the column is string
result = df[header].str.contains(valueString)

edited Jul 09 '21 at 01:22

Shivam Sahil

4,055
3
31
62

answered Jul 08 '21 at 03:28

BENY

317,841
20
164
234

Why have you replaced the spaces? Additionally what does `;` means when you pass it in get_dummies ? And `all(1)` means `all(axis=1)` right? – Shivam Sahil Jul 08 '21 at 03:30
@ShivamSahil you have '; ' and ';' in items, all(1) is all(axis=1) – BENY Jul 08 '21 at 03:37
I see, but that ; won't stay everywhere, my end goal is to check the string that is present in the item and use an input array to check if all of the inputs in the array exist in that string, when the string goes like : `app web` your logic woul fail right? – Shivam Sahil Jul 08 '21 at 03:39
thanks for the update, have added a small snippet to make it more reliable for all cases. Thanks again. – Shivam Sahil Jul 08 '21 at 04:11

Check if a pandas Dataframe string column contains all the elements given in an array

3 Answers3