Populate values for categorical data in their respective one-hot encoded columns

Question

I have an csv file which have 100s of columns and rows. There two columns are my interest and based on that I need to create new columns in that csv file. Example: I have interested columns as below, input.csv

 count  description    
    1   Good        
    2   Medium          
    2   Best        
    1   Worst       
    2   Worst       
    1   Medium      
    9   Good

Output I expected, output.csv

 count  description    Good  Medium  Best  Worst
    1   Good            1     0       0      0
    2   Medium          0     2       0      0
    2   Best            0     0       2      0
    1   Worst           0     0       0      1
    2   Worst           0     0       0      1
    1   Medium          0     1       0      0
    9   Good            9     0       0      0

In the above csv table, based on the values of 'count' and 'description'. I am creating new columns Good,Medium,Best,Worst (from description values) and fill the values from count and put null to unavailable values of that row. I tried with compare the columns and map the value,but not able to put the value properly in the new columns.

cs95 · Accepted Answer · 2018-12-17T14:13:02.467

`str.get_dummies`

For performance and succinctness, use str.get_dummies with broadcasted numpy multiplication.

v = df['description'].str.get_dummies()
df.join(pd.DataFrame(
    v.values * df['count'].values[:, None], columns=v.columns))


   count description  Best  Good  Medium  Worst
0      1        Good     0     1       0      0
1      2      Medium     0     0       2      0
2      2        Best     2     0       0      0
3      1       Worst     0     0       0      1
4      2       Worst     0     0       0      2
5      1      Medium     0     0       1      0
6      9        Good     0     9       0      0

I now see this is a slightly different version of @nixon's answer, but hopefully the difference in how broadcasting is performed should excuse me :-)

`pivot` and `fillna`

Or, a modified version of OP's pivot solution:

df.join(df.reset_index()
          .pivot(index='index', columns='description', values='count')
          .fillna(0, downcast='infer'))

   count description  Best  Good  Medium  Worst
0      1        Good     0     1       0      0
1      2      Medium     0     0       2      0
2      2        Best     2     0       0      0
3      1       Worst     0     0       0      1
4      2       Worst     0     0       0      2
5      1      Medium     0     0       1      0
6      9        Good     0     9       0      0

score 2 · Answer 2 · answered Dec 17 '18 at 13:49

A way to do it is to create dummy columns using pd.get_dummies, and multiply them with df['count] to get the desired output:

d = pd.get_dummies(df.description) 
df.assign(**d.multiply(df['count'], axis=0))

    count description  Best  Good  Medium  Worst
0      1        Good     0     1       0      0
1      2      Medium     0     0       2      0
2      2        Best     2     0       0      0
3      1       Worst     0     0       0      1
4      2       Worst     0     0       0      2
5      1      Medium     0     0       1      0
6      9        Good     0     9       0      0

score 2 · Answer 3 · answered Dec 17 '18 at 13:52

You can pivot and add original frame:

f = df.pivot(columns='description', values='count').fillna(value=0, downcast='infer')
df[['count', 'description']].merge(f, left_index=True, right_index=True)

Which results in:

   count description  Best  Good  Medium  Worst
0      1        Good     0     1       0      0
1      2      Medium     0     0       2      0
2      2        Best     2     0       0      0
3      1       Worst     0     0       0      1
4      2       Worst     0     0       0      2
5      1      Medium     0     0       1      0
6      9        Good     0     9       0      0

score 1 · Answer 4 · answered Dec 17 '18 at 14:08

thank you for the above answers. I also tried and the solution as follows,

import pandas as pd
df = pd.read_csv('/input.csv')
res = df.pivot(index='index', columns='description', values='count')
res.to_csv('/out.csv',',',dtype='unicode8')
res1 = res.replace('NaN', '0', regex=True)
res1.to_csv('/out1.csv',',',dtype='unicode8')

Thanks, Sundar

Populate values for categorical data in their respective one-hot encoded columns

4 Answers4

str.get_dummies

pivot and fillna

`str.get_dummies`

`pivot` and `fillna`