1

I have a DataFrame with a list of variables within each column. I cannot figure out how to One-Hot Encode the data within each column.

In:

lst = [['Red, Blue, Yellow', 'Blue, Green, Yellow', 'Green, Red, Blue'], ['Yellow, Red, Blue', 'Blue, Red, Green', 'Yellow, Blue, Red'], ['Yellow, Red, Green', 'Red, Yellow, Blue', 'Green, Blue, Red']]
    
df = pd.DataFrame(lst, columns =['A', 'B', 'C'], dtype = float)
Out:

        A                     B                        C
Ella    Red, Blue, Yellow     Blue, Green, Yellow      Green, Red, Blue
Mike    Yellow, Red, Blue     Blue, Red, Green         Yellow, Blue, Red
Dave    Yellow, Red, Green    Red, Yellow, Blue        Green, Blue, Red

I am looking to create it with multi-tiered column headings to look like the below:

       A                                 B                               C
       Red    Blue   Green   Yellow      Red    Blue   Green   Yellow    ....
Ella   1      1      0       1           0      1      1       1         ....
Mike   1      1      0       1           1      1      1       0         ....   
Dave   1      0      1       1           1      1      0       1         ....                                                                                                                                                     

I would be so grateful for some guidance as I've been stuck on this for a while!

2 Answers2

1

There is a very good answer here. In your case you have to apply the same to different columns, so something like (It can be further optimized):

import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
import numpy as np

lst = [['Red, Blue, Yellow', 'Blue, Green, Yellow', 'Green, Red, Blue'], ['Yellow, Red, Blue', 'Blue, Red, Green', 'Yellow, Blue, Red'], ['Yellow, Red, Green', 'Red, Yellow, Blue', 'Green, Blue, Red']]
    
df = pd.DataFrame(lst, columns =['A', 'B', 'C'], dtype = float)

mlb = {}
res = {}
for column in df.columns:
    mlb[column] = MultiLabelBinarizer()

    res[column] = pd.DataFrame(mlb[column].fit_transform(df[column].apply(lambda x: [j.strip() for j in x.split(",")])),
                       columns=mlb[column].classes_,
                       index=df[column].index)

arrays = [np.concatenate(([np.array([column]*len(mlb[column].classes_)) for column in df.columns])),
          np.concatenate(([mlb[column].classes_ for column in df.columns]))]
df_end = pd.DataFrame(columns = arrays, index = [0,1,2])

for column in df.columns:
    df_end[column] = res[column]

df_end


    A                       B                     C
    Blue Green Red  Yellow  Blue Green Red Yellow Blue Green Red Yellow
0   1    0     1    1       1    1     0   1      1    1     1   0
1   1    0     1    1       1    1     1   0      1    0     1   1
2   0    1     1    1       1    0     1   1      1    1     1   0
Let's try
  • 1,044
  • 9
  • 20
1

Here is a way:

df.stack().str.get_dummies(sep = ', ').unstack().swaplevel(axis=1).sort_index(level=0,axis=1)

or

df = df.stack().str.get_dummies(sep=',')
df.columns = df.columns.str.strip()
df = df.stack().groupby(level=[0,1,2]).sum().unstack(level=[1,2])
rhug123
  • 7,893
  • 1
  • 9
  • 24
  • I have a follow up question - if each category had different variable, e.g. A - Colour, B - Shape, C - Size, how could we make this solution work? – Anonymous Apr 16 '21 at 07:38
  • the top level of the columns should be able to be named whatever you would like without a problem – rhug123 Apr 16 '21 at 13:04