11

I have a dataset with multiple columns that I wish to one hot encode. However, I don't want to have the encoding for each one of them since said columns are related to the said items. What I want is one "set" of dummies variables that uses all the columns. See my code for a better explanation.

Suppose my dataframe looks like this:

In [103]: dum = pd.DataFrame({'ch1': ['A', 'C', 'A'], 'ch2': ['B', 'G', 'F'], 'ch3': ['C', 'D', 'E']})

In [104]: dum
Out[104]:
 ch1 ch2 ch3
0   A   B   C
1   C   G   D
2   A   F   E

If I execute

pd.get_dummies(dum)

The output will be

   ch1_A  ch1_C  ch2_B  ch2_F  ch2_G  ch3_C  ch3_D  ch3_E
 0      1      0      1      0      0      1      0      0
 1      0      1      0      0      1      0      1      0
 2      1      0      0      1      0      0      0      1

However, what I would like to obtain is something like this:

 A B C D E F G
 1 1 1 0 0 0 0
 0 0 1 1 0 0 1
 1 0 0 0 1 1 0

Instead of having multiple columns representing the encoding, e.g. ch1_A and ch1_C, I only wish to have one group (A, B, and so on) with value 1 when any of the values in the columns ch1, ch2, ch3 show up.

To clarify, in my original dataset, a single row won't contain the same value (A,B,C...) more than once; it will just appear on one of the columns.

user3276768
  • 1,416
  • 3
  • 18
  • 28

4 Answers4

10

Using stack and str.get_dummies

dum.stack().str.get_dummies().sum(level=0)
Out[938]: 
   A  B  C  D  E  F  G
0  1  1  1  0  0  0  0
1  0  0  1  1  0  0  1
2  1  0  0  0  1  1  0
BENY
  • 317,841
  • 20
  • 164
  • 234
  • 1
    I'll accept this as an answer. However, I'd like to comment (for those of you who are reading this), that this won't work so well if your dataset is big. In my case, the shape of my df was around (100000,120) and it crashed. Thanks for helping :) – user3276768 Sep 03 '18 at 15:05
6

You could use pd.crosstab to create a frequency table:

import pandas as pd

dum = pd.DataFrame({'ch1': ['A', 'C', 'A'], 'ch2': ['B', 'G', 'F'], 'ch3': ['C', 'D', 'E']})

stacked = dum.stack()
index = stacked.index.get_level_values(0)
result = pd.crosstab(index=index, columns=stacked)
result.index.name = None
result.columns.name = None

print(result)

yields

   A  B  C  D  E  F  G
0  1  1  1  0  0  0  0
1  0  0  1  1  0  0  1
2  1  0  0  0  1  1  0
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
0

Call it this way

x = pd.get_dummies(dum, prefix="", prefix_sep="")

And then print using

print(x.to_string(index=False))
ayhan
  • 70,170
  • 20
  • 182
  • 203
Mean Coder
  • 304
  • 1
  • 12
0

You can create dummies for separate columns and concat the results:

temp = pd.concat([pd.get_dummies(dum[col]) for col in dum], axis=1)

    A   C   B   F   G   C   D   E
0   1   0   1   0   0   1   0   0
1   0   1   0   0   1   0   1   0
2   1   0   0   1   0   0   0   1

temp.groupby(level=0, axis=1).sum()

    A   B   C   D   E   F   G
0   1   1   1   0   0   0   0
1   0   0   1   1   0   0   1
2   1   0   0   0   1   1   0
Raunaq Jain
  • 917
  • 7
  • 13
  • In this case I'd have repeated columns (you have two C's). I know that I could join them and keep cleaning the dataset, but I was wondering if there's a straight and simpler way to do it. – user3276768 Aug 26 '18 at 17:41
  • Absolutely my fault. I didn't notice it at all. Since Wen has given a beautiful answer, I will be adding on to my own one. – Raunaq Jain Aug 26 '18 at 18:48