1

I have a dataframe as shown below:

>>> import pandas as pd
>>> df = pd.DataFrame(data = [['app;',1,2,3],['app; web;',4,5,6],['web;',7,8,9],['',1,4,5]],columns = ['a','b','c','d'])
>>> df
           a  b  c  d
0       app;  1  2  3
1  app; web;  4  5  6
2       web;  7  8  9
3             1  4  5

I have an input array that looks like this: ["app","web"] For each of these values I want to check against a specific column of a dataframe and return a decision as shown below:

>>> df.a.str.contains("app")
0     True
1     True
2    False
3    False

Since str.contains only allows me to look for an individual value, I was wondering if there's some other direct way to determine the same something like:

 df.a.str.contains(["app","web"]) # Returns TypeError: unhashable type: 'list'

My end goal is not to do an absolute match (df.a.isin(["app", "web"]) but rather a 'contains' logic that says return true even if it has those characters present in that cell of data frame.

Note: I can of course use apply method to create my own function for the same logic such as:

elementsToLookFor = ["app","web"]
df[header] = df.apply(lambda element: all([a in element for a in elementsToLookFor]))

But I am more interested in the optimal algorithm for this and so prefer to use a native pandas function within pandas, or else the next most optimized custom solution.

smci
  • 32,567
  • 20
  • 113
  • 146
Shivam Sahil
  • 4,055
  • 3
  • 31
  • 62
  • Similar to this question [Check if pandas column contains all elements from a list](https://stackoverflow.com/q/60932036/15497888) – Henry Ecker Jul 08 '21 at 03:33
  • I saw your previous comment `df.a.str.contains("|".join(["app","web"]))` which almost solves my case except for the fact I was looking for `&` operation. I am not very good with regex and was thinking if there's a way to `&` it instead of `\` with your previous logic. something like: `df.a.str.contains("&".join(["app","web"]))` – Shivam Sahil Jul 08 '21 at 03:38
  • There isn't an "and" type operation in that context for regular expressions. – Henry Ecker Jul 08 '21 at 03:44
  • From the linked question though: `np.all([df['a'].str.contains(v) for v in ['app', 'web']], axis=0)` – Henry Ecker Jul 08 '21 at 03:46
  • Previous asking [pandas dataframe `str.contains()` AND operation](https://stackoverflow.com/questions/37011734/pandas-dataframe-str-contains-and-operation). There are many duplicates on this in SO. – smci Jul 08 '21 at 04:20
  • `["app","web"]` is not an 'array', it's a list. Please use Python terminology not Javascript/JSON. – smci Jul 08 '21 at 04:22
  • There is no need to do `get_dummies()`. There are plenty of existing solutionson SO showing matching multiple substrings, in any order, on a pandas string column. Please edit the title and body if you want to rephrase the question. – smci Jul 08 '21 at 04:28
  • @smci the link which you shared is more of a static solution, you need to do anding of defined values and that's why I was looking for a generalized solution – Shivam Sahil Jul 08 '21 at 04:56
  • @ShivamSahil: then can post a better solution there. It was one of many duplicates on SO; you can also find another duplicate and cite it here. It is not good for SO to keep posting and reasking the same questions. – smci Jul 08 '21 at 19:36

3 Answers3

2

This should work too:

l = ["app","web"]
df['a'].str.findall('|'.join(l)).map(lambda x: len(set(x)) == len(l))

also this should work as well:

pd.concat([df['a'].str.contains(i) for i in l],axis=1).all(axis = 1)
rhug123
  • 7,893
  • 1
  • 9
  • 24
2

so many solutions, which one is the most efficient

The str.contains-based answers are generally fastest, though str.findall is also very fast on smaller dfs:

timings vs len(df)

values = ['app', 'web']
pattern = ''.join(f'(?=.*{value})' for value in values)

def replace_dummies_all(df):
    return df.a.str.replace(' ', '').str.get_dummies(';')[values].all(1)

def findall_map(df):
    return df.a.str.findall('|'.join(values)).map(lambda x: len(set(x)) == len(values))

def lower_contains(df):
    return df.a.astype(str).str.lower().str.contains(pattern)

def contains_concat_all(df):
    return pd.concat([df.a.str.contains(l) for l in values], axis=1).all(1)

def contains(df):
    return df.a.str.contains(pattern)
tdy
  • 36,675
  • 19
  • 86
  • 83
1

Try with str.get_dummies

df.a.str.replace(' ','').str.get_dummies(';')[['web','app']].all(1)
0    False
1     True
2    False
3    False
dtype: bool

Update

df['a'].str.contains(r'^(?=.*web)(?=.*app)')

Update 2: (To ensure case insenstivity doesn't matter and the column dtype is str without which the logic may fail):

elementList = ['app','web']
for eachValue in elementList:
                    valueString += f'(?=.*{eachValue})'
df[header] = df[header].astype(str).str.lower() #To ensure case insenstivity and the dtype of the column is string
result = df[header].str.contains(valueString)
Shivam Sahil
  • 4,055
  • 3
  • 31
  • 62
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Why have you replaced the spaces? Additionally what does `;` means when you pass it in get_dummies ? And `all(1)` means `all(axis=1)` right? – Shivam Sahil Jul 08 '21 at 03:30
  • @ShivamSahil you have '; ' and ';' in items, all(1) is all(axis=1) – BENY Jul 08 '21 at 03:37
  • I see, but that ; won't stay everywhere, my end goal is to check the string that is present in the item and use an input array to check if all of the inputs in the array exist in that string, when the string goes like : `app web` your logic woul fail right? – Shivam Sahil Jul 08 '21 at 03:39
  • thanks for the update, have added a small snippet to make it more reliable for all cases. Thanks again. – Shivam Sahil Jul 08 '21 at 04:11