How to convert binary columns with multiple occurrences into categorical data in Pandas

Question

I have the following example data set

A	B	C	D
foo	0	1	1
bar	0	0	1
baz	1	1	0

How could extract the column names of each 1 occurrence in a row and put that into another column E so that I get the following table:

A	B	C	D	E
foo	0	1	1	C, D
bar	0	0	1	D
baz	1	1	0	B, C

Note that there can be more than two 1s per row.

wavingtide · Accepted Answer · 2022-12-29T02:34:07.703

2

You can use DataFrame.dot.

df['E'] = df[['B', 'C', 'D']].dot(df.columns[1:] + ', ').str.rstrip(', ')
df

     A  B  C  D     E
0  foo  0  1  1  C, D
1  bar  0  0  1     D
2  baz  1  1  0  B, C

Inspired by jezrael's answer in this post.

Another way is that you can convert each row to boolean and use it as a selection mask to filter the column names.

cols = pd.Index(['B', 'C', 'D'])

df['E'] = df[cols].astype('bool').apply(lambda row: ", ".join(cols[row]), axis=1)
df

     A  B  C  D     E
0  foo  0  1  1  C, D
1  bar  0  0  1     D
2  baz  1  1  0  B, C

edited Dec 29 '22 at 02:34

answered Nov 13 '22 at 01:08

wavingtide

1,032
4
19

This answer is neat and can be applied also if there are other columns in between A and B. Marking as accepted. Could you also please explain what the dot does in this case? – Dan G Nov 13 '22 at 01:15
The dot product does matrix multiplication. Based on my understanding, `df.dot` use `np.dot`, which use the Python native implementation of `dot`. In this case, dot product between an integer and column index is like how multiplying string and integer in python works (eg: `'A'*3 + 'B'*2` returns `'AAABB'`). It only works for string in `pd.Index` type, but not list and numpy array. I am not sure why. – wavingtide Nov 13 '22 at 01:56

How to convert binary columns with multiple occurrences into categorical data in Pandas

1 Answers1