0

Hello stackoverflowers!!

I have the following data frame

import numpy as np 
import pandas as pd

dataframe = pd.DataFrame({'column1': ['a', 'alike', 'b', 'x', 'a'],
                          'column2': ['a', 'b', 'c', 'unlake', 'like']})

  column1 column2
0       a       a
1   alike       b
2       b       c
3       x  unlake
4       a    like

And I would like to create another column that has 1 if any of the following strings

check = ['like', 'lake', 'lik']

is in any of the two columns.

I started with this:

any([check1 in dataframe['column1'][1] for check1 in check]) # for one value this works

however when I want to do it for the whole column I do not get the expected result

dataframe[['column1']].apply(lambda row: any((check1 in row for check1 in check)), axis = 1) # this works but does not give the expected

I am propably missing something.

Thanks for the help

George Sotiropoulos
  • 1,864
  • 1
  • 22
  • 32

3 Answers3

1

This should give you what you want

df['new_col'] = np.where(df.isin(check).any(axis=1),1,0) #this is for an exact match

Expanding to see if it contains any of the items

import numpy as np 
import pandas as pd

df = pd.DataFrame({'column1': ['a', 'alike', 'b', 'x', 'a'],
                      'column2': ['a', 'b', 'c', 'unlake', 'like']})
check = ['like', 'lake', 'lik']
pattern = '|'.join(check)
mask = np.column_stack([df[col].str.contains(pattern, na=False) for col in df])
df['check'] = np.where(mask.any(axis=1),1,0)
usernamenotfound
  • 1,540
  • 2
  • 11
  • 18
1

Combine np.logical_or with pd.Series.str.contains. This assumes partial matches are permitted.

import numpy as np 
import pandas as pd

df = pd.DataFrame({'column1': ['a', 'alike', 'b', 'x', 'a'],
                   'column2': ['a', 'b', 'c', 'unlake', 'like']})

test_str = '|'.join({'like', 'lake', 'lik'})

df['Test'] = np.logical_or(df['column1'].str.contains(test_str),
                           df['column2'].str.contains(test_str))


# output
# column1 column2 Test
# a a False 
# alike b True 
# b c False 
# x unlake True 
# a like True 
jpp
  • 159,742
  • 34
  • 281
  • 339
1

You can join your substrings together using the regex "or" operator | and then apply the pd.Series.str.contains method like so:

dataframe['substr_match'] = dataframe.apply(\
  lambda col: col.str.contains('|'.join(check))).any(axis=1)
#   column1 column2  substr_match
# 0       a       a         False
# 1   alike       b          True
# 2       b       c         False
# 3       x  unlake          True
# 4       a    like          True
cmaher
  • 5,100
  • 1
  • 22
  • 34
  • Thanks a lot! So, you first go over the columns and you do a vectorized operation and then you use any for each row.. what if you could not use the contains for the whole vector but only for one value? – George Sotiropoulos Jan 25 '18 at 09:30
  • 1
    @GeorgeSotiropoulos, I'm sure this method is sufficient for your use case. But worth noting `df.apply` is generally vectorised only in notation (see https://stackoverflow.com/a/38708239/9209546). – jpp Jan 25 '18 at 09:44
  • 1
    @GeorgeSotiropoulos `apply` still works on each value of a pandas Series individually, if that's what you mean (i.e. `contains` already works one value at a time, in this case). Try calling `dataframe.apply(lambda col: col.str.contains('|'.join(check)))` on its own if this is unclear. – cmaher Jan 25 '18 at 12:43
  • 1
    @cmaher Thanks for your answers, I am coming from R and DataTables so I still have to find my steps with pandas – George Sotiropoulos Jan 25 '18 at 15:03