Check if any column in a subset of columns contains any string in a list of strings pandas row-wise?

Question

I am looking for a way to check if any column in a subset of dataframe columns contains any string from a list of strings. I want to know if there is a better way to do it than using apply?

df = pd.DataFrame({'col1': ['cat', 'dog', 'mouse'],
         'col2': ['car', 'bike', '']})

def check_data(df, cols, strings):

    for j in cols:
        if df[j] in strings:
           
            return 1
    else:
        return 0

df['answer'] = df.apply(check_data, cols=['col1'], strings=['dog', 'cat'], axis=1)

Output:

col1  col2  answer
cat   car       1
dog  bike       1
mouse           0






df['answer'] = df.apply(check_data, cols=['col1', 'col2'], strings=['bike', 'mouse'], axis=1)
Output2:

    col1  col2  answer
    cat   car       0
    dog  bike       1
    mouse           1

This gives the desired output but I want to know if there is a better more pythonic way to do this without applying the function to each row of the data? Thanks!

you're only checking one column here. How you like a result when you're checking all columns? — Naveed, Sep 28 '22 at 19:39
posted a solution, its not clear how your expected outcome be if you're to check multiple columns for the string — Naveed, Sep 28 '22 at 19:47

score 1 · Answer 1 · answered Sep 28 '22 at 19:26

I had to add few columns so as to not use apply function.

df = pd.DataFrame({'col1': ['cat', 'dog', 'mouse'],
     'col2': ['car', 'bike', '']})
df['col3'] = df.values.tolist() # creating new column as lists 
df['strings'] = [['dog', 'cat'] for i in df.index]  # creating new column with list of strings 
df['common']  =  [list(set(a).intersection(set(b))) for a, b in zip(df['col3'], df['strings'])] # getting common elements 
df['answer'] = np.where(df['common'].str.len()>0,1,0) 
df.drop(['col3','strings','common'],axis=1,inplace=True) #dropping created cols

I guess this code can be cleaned further.

score 1 · Answer 2 · answered Sep 28 '22 at 19:42

your question stated list of columns, but expected result was for only one column.

would you have a separate answer column corresponding to each column when evaluating multiple columns?

so, in case you need to check one column here is one way to do it without apply

df['answer']=df['col1'].isin(strings).astype(int)
df

    col1    col2    answer
0   cat     car     1
1   dog     bike    1
2   mouse           0

Markus Kiesel · Answer 3 · 2022-09-29T13:32:19.123

0

If you do not want to use apply I would suggest the following.

  df['answer'] = df[['col1']].isin(['dog', 'cat']).any(axis=1)

But if your concern is performance I'm not sure if isin() performs better than apply(). Have a look at the link maybe you can implement it with vectorised calculations.

Performance of Pandas apply vs np.vectorize to create new column from existing columns

edited Sep 29 '22 at 13:32

answered Sep 28 '22 at 17:49

Markus Kiesel

41
4

Thanks very much for the suggestions! As mentioned in the question, I am looking for a way to do this without using apply... – Sjoseph Sep 28 '22 at 17:55
@Sjoseph I updated the answer for solving your problem without apply. – Markus Kiesel Sep 29 '22 at 13:34
Thanks for your answer and details. This snippet does not run without errors for me. Also, I don't think this logic works for any column in a subset of columns, – Sjoseph Sep 29 '22 at 15:02
You can add multiple columns inside the brackets. df[['col1', 'col2']].isin(... – Markus Kiesel Sep 29 '22 at 15:43

Check if any column in a subset of columns contains any string in a list of strings pandas row-wise?

3 Answers3