Creating a column based on the presence of part of a string in multiple other columns

Question

I have a dataframe called df that looks similar to this (except the number of 'mat_deliv' columns goes up to mat_deliv_8 and there are several hundred clients - I have simplified it here).

Client_ID  mat_deliv_1  mat_deliv_2  mat_deliv_3  mat_deliv_4
C1019876   xxx,yyy,zzz  aaa,bbb,xxx  ccc          ddd
C1018765   yyy,zzz      xxx          bbb          None
C1017654   yyy,xxx      aaa,bbb      ccc          ddd
C1016543   aaa,bbb      ccc          None         None
C1019876   yyy          None         None         None

I want to create a new column called xxx_deliv with two values, 0 and 1. I want to set xxx_deliv to equal 1 if any one of mat_deliv_1, mat_deliv_2, mat_deliv_3 or mat_deliv_4 contains xxx, and 0 if they do not.

So, I want to add a column that, with the example df, would look like this:

Client_ID  mat_deliv_1  mat_deliv_2  mat_deliv_3  mat_deliv_4  xxx_deliv
C1019876   xxx,yyy,zzz  aaa,bbb,xxx  ccc          ddd          1
C1018765   yyy,zzz      xxx          bbb          None         1
C1017654   yyy,xxx      aaa,bbb      ccc          ddd          1
C1016543   aaa,bbb      ccc          None         None         0
C1019876   yyy          None         None         None         0

I know that the following code performs the desired task:

df['xxx_deliv'] = 0
df.loc[(df.Mat_deliv_1.str.contains("xxx", na=False)) |
       (df.Mat_deliv_2.str.contains("xxx", na=False)) |
       (df.Mat_deliv_3.str.contains("xxx", na=False)) |
       (df.Mat_deliv_4.str.contains("xxx", na=False)),
       'xxx_deliv'] = 1

But I would like to be able to do this without going through each individual column - I need to be able to search through multiple columns at once.

Alexander · Accepted Answer · 2019-01-07T21:35:25.730

3

You need to check each column individually. You can do this via apply, checking that the string contains the target text. Then apply any to the row (by specifying axis=1). Convert the boolean result to an integer via .astype(int), and then used assign to add it as a new column to the dataframe.

I used loc[:, target_cols] to specify my search range as all rows in the dataframe and all of the chosen target_cols.

target_cols = ['mat_deliv_1', 'mat_deliv_2', 'mat_deliv_3', 'mat_deliv_4']
df = (df
      .assign(xxx_deliv=df.loc[:, target_cols].apply(lambda col: col.str.contains('xxx'))
      .any(axis=1)
      .astype(int))
>>> df
  Client_ID  mat_deliv_1  mat_deliv_2 mat_deliv_3 mat_deliv_4  xxx_deliv
0  C1019876  xxx,yyy,zzz  aaa,bbb,xxx         ccc         ddd          1
1  C1018765      yyy,zzz          xxx         bbb        None          1
2  C1017654      yyy,xxx      aaa,bbb         ccc         ddd          1
3  C1016543      aaa,bbb          ccc        None        None          0
4  C1019876          yyy         None        None        None          0

edited Jan 07 '19 at 21:35

answered Jan 07 '19 at 17:52

Alexander

105,104
32
201
196

That works perfectly, thanks Alexander. I made one edit, replacing iloc with loc, as I have a number of columns between `Client_ID` and `mat_deliv_1`, and this number is likely to change; fixing by label instead of position means I won't have to change the script in future: `df = df.assign(xxx_deliv=df.loc[:, "mat_deliv_1":"mat_deliv_4"].apply(lambda col: col.str.contains('xxx')).any(axis=1).astype(int))` I'm going to have to learn more about `lambda`, I've been seeing it everywhere! – FGreen Jan 07 '19 at 21:30
1

@FGreen See edited answer above. I've added a list container to hold the target columns where you want to check the contents. – Alexander Jan 07 '19 at 21:36
Thanks @Alexander - quick follow up question: How would I adapt it this if I wanted the `xxx_deliv` column to count the number of times `xxx` occurred in the four `mat_deliv` columns (e.g. `2` in row 0, `1` in row 1) instead of just a binary 0/1 value for absence/presence? – FGreen Jan 08 '19 at 16:01
@FGreen You should ask follow-on questions as a new question (providing a link to the original question to provide context). In this case, you can try `..apply(lambda col: col.str.count('xxx')).sum(axis=1)` – Alexander Jan 08 '19 at 22:50
New question asked here: [https://stackoverflow.com/questions/54098875/count-occurrences-of-a-string-in-multiple-string-columns] Thanks Alexander. – FGreen Jan 08 '19 at 23:05

Dani Mesejo · Answer 2 · 2019-01-07T18:01:08.177

You could use apply:

def contains(xs, pat='xxx'):
    return int(any(pat in x for x in xs.values))


df['xxx_deliv'] = df[['mat_deliv_1', 'mat_deliv_2', 'mat_deliv_3', 'mat_deliv_4']].apply(contains, axis=1)
print(df)

Output

  Client_ID  mat_deliv_1    ...    mat_deliv_4 xxx_deliv
0  C1019876  xxx,yyy,zzz    ...            ddd         1
1  C1018765      yyy,zzz    ...           None         1
2  C1017654      yyy,xxx    ...            ddd         1
3  C1016543      aaa,bbb    ...           None         0
4  C1019876          yyy    ...           None         0

[5 rows x 6 columns]

Creating a column based on the presence of part of a string in multiple other columns

2 Answers2

Linked