Pandas get_dummies on multiple columns

Question

I have a dataset with multiple columns that I wish to one hot encode. However, I don't want to have the encoding for each one of them since said columns are related to the said items. What I want is one "set" of dummies variables that uses all the columns. See my code for a better explanation.

Suppose my dataframe looks like this:

In [103]: dum = pd.DataFrame({'ch1': ['A', 'C', 'A'], 'ch2': ['B', 'G', 'F'], 'ch3': ['C', 'D', 'E']})

In [104]: dum
Out[104]:
 ch1 ch2 ch3
0   A   B   C
1   C   G   D
2   A   F   E

If I execute

pd.get_dummies(dum)

The output will be

   ch1_A  ch1_C  ch2_B  ch2_F  ch2_G  ch3_C  ch3_D  ch3_E
 0      1      0      1      0      0      1      0      0
 1      0      1      0      0      1      0      1      0
 2      1      0      0      1      0      0      0      1

However, what I would like to obtain is something like this:

 A B C D E F G
 1 1 1 0 0 0 0
 0 0 1 1 0 0 1
 1 0 0 0 1 1 0

Instead of having multiple columns representing the encoding, e.g. ch1_A and ch1_C, I only wish to have one group (A, B, and so on) with value 1 when any of the values in the columns ch1, ch2, ch3 show up.

To clarify, in my original dataset, a single row won't contain the same value (A,B,C...) more than once; it will just appear on one of the columns.

score 10 · Accepted Answer · answered Aug 26 '18 at 18:38

10

Using stack and str.get_dummies

dum.stack().str.get_dummies().sum(level=0)
Out[938]: 
   A  B  C  D  E  F  G
0  1  1  1  0  0  0  0
1  0  0  1  1  0  0  1
2  1  0  0  0  1  1  0

answered Aug 26 '18 at 18:38

BENY

317,841
20
164
234

1

I'll accept this as an answer. However, I'd like to comment (for those of you who are reading this), that this won't work so well if your dataset is big. In my case, the shape of my df was around (100000,120) and it crashed. Thanks for helping :) – user3276768 Sep 03 '18 at 15:05

unutbu · Answer 2 · 2018-08-26T23:38:51.953

You could use pd.crosstab to create a frequency table:

import pandas as pd

dum = pd.DataFrame({'ch1': ['A', 'C', 'A'], 'ch2': ['B', 'G', 'F'], 'ch3': ['C', 'D', 'E']})

stacked = dum.stack()
index = stacked.index.get_level_values(0)
result = pd.crosstab(index=index, columns=stacked)
result.index.name = None
result.columns.name = None

print(result)

yields

   A  B  C  D  E  F  G
0  1  1  1  0  0  0  0
1  0  0  1  1  0  0  1
2  1  0  0  0  1  1  0

score 0 · Answer 3 · edited Aug 26 '18 at 18:00

0

Call it this way

x = pd.get_dummies(dum, prefix="", prefix_sep="")

And then print using

print(x.to_string(index=False))

edited Aug 26 '18 at 18:00

ayhan

70,170
20
182
203

answered Aug 26 '18 at 17:24

Mean Coder

304
1
12

The output shows repeated columns. It looks just like the output Raunaq (another answer) posted. – user3276768 Aug 26 '18 at 18:18

Raunaq Jain · Answer 4 · 2018-08-26T18:49:06.513

0

You can create dummies for separate columns and concat the results:

temp = pd.concat([pd.get_dummies(dum[col]) for col in dum], axis=1)

    A   C   B   F   G   C   D   E
0   1   0   1   0   0   1   0   0
1   0   1   0   0   1   0   1   0
2   1   0   0   1   0   0   0   1

temp.groupby(level=0, axis=1).sum()

    A   B   C   D   E   F   G
0   1   1   1   0   0   0   0
1   0   0   1   1   0   0   1
2   1   0   0   0   1   1   0

edited Aug 26 '18 at 18:49

answered Aug 26 '18 at 17:27

Raunaq Jain

917
7
13

In this case I'd have repeated columns (you have two C's). I know that I could join them and keep cleaning the dataset, but I was wondering if there's a straight and simpler way to do it. – user3276768 Aug 26 '18 at 17:41
Absolutely my fault. I didn't notice it at all. Since Wen has given a beautiful answer, I will be adding on to my own one. – Raunaq Jain Aug 26 '18 at 18:48

Pandas get_dummies on multiple columns

4 Answers4

Linked

Related