0

I am looking for a way to check if any column in a subset of dataframe columns contains any string from a list of strings. I want to know if there is a better way to do it than using apply?

df = pd.DataFrame({'col1': ['cat', 'dog', 'mouse'],
         'col2': ['car', 'bike', '']})

def check_data(df, cols, strings):

    for j in cols:
        if df[j] in strings:
           
            return 1
    else:
        return 0

df['answer'] = df.apply(check_data, cols=['col1'], strings=['dog', 'cat'], axis=1)

Output:

col1  col2  answer
cat   car       1
dog  bike       1
mouse           0






df['answer'] = df.apply(check_data, cols=['col1', 'col2'], strings=['bike', 'mouse'], axis=1)
Output2:

    col1  col2  answer
    cat   car       0
    dog  bike       1
    mouse           1

This gives the desired output but I want to know if there is a better more pythonic way to do this without applying the function to each row of the data? Thanks!

Sjoseph
  • 853
  • 2
  • 14
  • 23
  • you're only checking one column here. How you like a result when you're checking all columns? – Naveed Sep 28 '22 at 19:39
  • posted a solution, its not clear how your expected outcome be if you're to check multiple columns for the string – Naveed Sep 28 '22 at 19:47

3 Answers3

1

I had to add few columns so as to not use apply function.

df = pd.DataFrame({'col1': ['cat', 'dog', 'mouse'],
     'col2': ['car', 'bike', '']})
df['col3'] = df.values.tolist() # creating new column as lists 
df['strings'] = [['dog', 'cat'] for i in df.index]  # creating new column with list of strings 
df['common']  =  [list(set(a).intersection(set(b))) for a, b in zip(df['col3'], df['strings'])] # getting common elements 
df['answer'] = np.where(df['common'].str.len()>0,1,0) 
df.drop(['col3','strings','common'],axis=1,inplace=True) #dropping created cols

I guess this code can be cleaned further.

1

your question stated list of columns, but expected result was for only one column.

would you have a separate answer column corresponding to each column when evaluating multiple columns?

so, in case you need to check one column here is one way to do it without apply

df['answer']=df['col1'].isin(strings).astype(int)
df
    col1    col2    answer
0   cat     car     1
1   dog     bike    1
2   mouse           0
Naveed
  • 11,495
  • 2
  • 14
  • 21
0

If you do not want to use apply I would suggest the following.

  df['answer'] = df[['col1']].isin(['dog', 'cat']).any(axis=1)

But if your concern is performance I'm not sure if isin() performs better than apply(). Have a look at the link maybe you can implement it with vectorised calculations.

Performance of Pandas apply vs np.vectorize to create new column from existing columns

  • Thanks very much for the suggestions! As mentioned in the question, I am looking for a way to do this without using apply... – Sjoseph Sep 28 '22 at 17:55
  • @Sjoseph I updated the answer for solving your problem without apply. – Markus Kiesel Sep 29 '22 at 13:34
  • Thanks for your answer and details. This snippet does not run without errors for me. Also, I don't think this logic works for any column in a subset of columns, – Sjoseph Sep 29 '22 at 15:02
  • You can add multiple columns inside the brackets. df[['col1', 'col2']].isin(... – Markus Kiesel Sep 29 '22 at 15:43