I have a dataframe called df
that looks similar to this (except the number of 'mat_deliv' columns goes up to mat_deliv_8, there are several hundred clients and a number of other columns between Client_ID
and mat_deliv_1
- I have simplified it here).
Client_ID mat_deliv_1 mat_deliv_2 mat_deliv_3 mat_deliv_4
C1019876 xxx,yyy,zzz aaa,bbb,xxx xxx ddd
C1018765 yyy,zzz xxx xxx None
C1017654 yyy,xxx aaa,bbb ccc ddd
C1016543 aaa,bbb ccc None None
C1019876 yyy None None None
I want to create a new column called xxx_count
which counts the number of times xxx
appears in mat_deliv_1
, mat_deliv_2
, mat_deliv_3
and mat_deliv_4
. The values should look like this:
Client_ID mat_deliv_1 mat_deliv_2 mat_deliv_3 mat_deliv_4 xxx_count
C1019876 xxx,yyy,zzz aaa,xxx,bbb xxx ddd 3
C1018765 yyy,zzz xxx xxx None 2
C1017654 yyy,xxx aaa,bbb ccc ddd 1
C1016543 aaa,bbb ccc None None 0
C1015432 yyy None None None 0
I have tried the following code:
df = df.assign(xxx_count=df.loc[:, "mat_deliv_1":"mat_deliv_4"].\
apply(lambda col: col.str.count('xxx')).fillna(0).astype(int))
But it does not produce a count, only a binary variable where 0
= no cases of xxx
and 1
= the presence of xxx
in at least one of the four mat_deliv
columns.
NB: this is a follow-up question to that asked here: Creating a column based on the presence of part of a string in multiple other columns