Include more columns with get_dummies()

Question

I have the following lists:

vocab = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
list1 = ['a', 'b', 'c', 'd', 'e']
list2 = ['f', 'g', 'h', 'i', 'j']

With the following code, I would like to get an encoding that creates a one-hot-encoding for list 1, but includes all the items from vocab.

import pandas as pd
encoding1 = pd.get_dummies(data= list1, columns= vocab)
encoding2 = pd.get_dummies(data= list2, columns= vocab)

I want the output:

encoding1 =      a   b   c   d   e   f   g   h   i   j
              1  1   0   0   0   0   0   0   0   0   0
              2  0   1   0   0   0   0   0   0   0   0 
              3  0   0   1   0   0   0   0   0   0   0
              4  0   0   0   1   0   0   0   0   0   0
              5  0   0   0   0   1   0   0   0   0   0

encoding2 =      a   b   c   d   e   f   g   h   i   j
              1  0   0   0   0   0   1   0   0   0   0
              2  0   0   0   0   0   0   1   0   0   0 
              3  0   0   0   0   0   0   0   1   0   0
              4  0   0   0   0   0   0   0   0   1   0
              5  0   0   0   0   0   0   0   0   0   1

However, I get the output:

encoding1 =      a   b   c   d   e   
              1  1   0   0   0   0  
              2  0   1   0   0   0  
              3  0   0   1   0   0  
              4  0   0   0   1   0   
              5  0   0   0   0   1   

encoding2 =      f   g   h   i   j   
              1  1   0   0   0   0  
              2  0   1   0   0   0  
              3  0   0   1   0   0  
              4  0   0   0   1   0   
              5  0   0   0   0   1

What can I do to get the desired output?

It is similar to the one asked here https://stackoverflow.com/questions/37425961/dummy-variables-when-not-all-categories-are-present — Nivi, May 28 '18 at 12:25

U13-Forward · Answer 1 · 2018-05-28T12:43:53.287

Try converting the dummies to data frames then we assign the columns to vocab then lots of NaN's will show up in the new columns then use the pandas fillna function for data frames and in the parameters we write 0 so it converts all the NaN's to 0:

encoding1 = pd.get_dummies(data= list1)
encoding2 = pd.get_dummies(data= list2)
df1 = pd.DataFrame(encoding1, columns=vocab)
df2 = pd.DataFrame(encoding2, columns=vocab)
print(df1.fillna(0))
print(df2.fillna(0))

Output:

df1
   a  b  c  d  e    f    g    h    i    j
0  1  0  0  0  0  0.0  0.0  0.0  0.0  0.0
1  0  1  0  0  0  0.0  0.0  0.0  0.0  0.0
2  0  0  1  0  0  0.0  0.0  0.0  0.0  0.0
3  0  0  0  1  0  0.0  0.0  0.0  0.0  0.0
4  0  0  0  0  1  0.0  0.0  0.0  0.0  0.0
df2
     a    b    c    d    e  f  g  h  i  j
0  0.0  0.0  0.0  0.0  0.0  1  0  0  0  0
1  0.0  0.0  0.0  0.0  0.0  0  1  0  0  0
2  0.0  0.0  0.0  0.0  0.0  0  0  1  0  0
3  0.0  0.0  0.0  0.0  0.0  0  0  0  1  0
4  0.0  0.0  0.0  0.0  0.0  0  0  0  0  1

score 0 · Answer 2 · answered May 28 '18 at 13:18

i would try

vocab_dummies = pd.get_dummies(data= vocab)

encoding1 = vocab_dummies.iloc[0:5,:]
encoding2 = vocab_dummies.iloc[5:vocab_dummies.shape[0],:].reset_index(drop=True)

encoding1
Out[67]: 
   a  b  c  d  e  f  g  h  i  j
0  1  0  0  0  0  0  0  0  0  0
1  0  1  0  0  0  0  0  0  0  0
2  0  0  1  0  0  0  0  0  0  0
3  0  0  0  1  0  0  0  0  0  0
4  0  0  0  0  1  0  0  0  0  0

encoding2
Out[68]: 
   a  b  c  d  e  f  g  h  i  j
0  0  0  0  0  0  1  0  0  0  0
1  0  0  0  0  0  0  1  0  0  0
2  0  0  0  0  0  0  0  1  0  0
3  0  0  0  0  0  0  0  0  1  0
4  0  0  0  0  0  0  0  0  0  1

Include more columns with get_dummies()

2 Answers2