1

I have a DataFrame with multiple variables within each column - the datatype of these are strings not lists. I am unable to One-Hot Encode the data within each column.

Out:

        A                     B                             C
Ella    Red; Blue; Yellow     Circle; Square; Triangle      Small; Medium; Extra big
Mike    Yellow; Red; Blue     Oval; Triangle; Circle        Medium; Big; Extra big
Dave    Yellow; Red; Green    Circle; Square; Triangle      Extra small; Medium; Big

I am looking to create it with multi-tiered column headings to look like the below:

       A                                 B                                     C
       Red    Blue   Green   Yellow      Circle   Triangle  Square   Oval      ....
Ella   1      1      0       1           1        1         1        0         ....
Mike   1      1      0       1           1        1         0        1         ....   
Dave   1      0      1       1           1        1         1        0         .... 

I tried this, and it helped me but only works when all the columns have the same variables: https://stackoverflow.com/a/67110743/15646168

df = df.stack().str.get_dummies(sep=',')
df.columns = df.columns.str.strip()
df = df.stack().groupby(level=[0,1,2]).sum().unstack(level=[1,2])

Thank you so much!

1 Answers1

0

Use concat in dict comprehseion with Series.str.get_dummies - only is changed separator to ; :

df = pd.concat({x: df[x].str.get_dummies(sep='; ') for x in df.columns}, axis=1)
print (df)
        A                       B                        C            \
     Blue Green Red Yellow Circle Oval Square Triangle Big Extra big   
Ella    1     0   1      1      1    0      1        1   0         1   
Mike    1     0   1      1      1    1      0        1   1         1   
Dave    0     1   1      1      1    0      1        1   1         0   

                               
     Extra small Medium Small  
Ella           0      1     1  
Mike           0      1     0  
Dave           1      1     0  
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • I am getting the following error - AttributeError: Can only use .str accessor with string values! – Anonymous Apr 16 '21 at 11:32
  • @Anonymous - So there are some another columns without `;` like in question? – jezrael Apr 16 '21 at 11:33
  • @Anonymous - Is possible filter columns by `df.select_dtypes('object')` for strings columns like `pd.concat({x: df[x].str.get_dummies(sep='; ') for x in df.select_dtypes('object').columns}, axis=1)` ? – jezrael Apr 16 '21 at 11:34