1

I have the following lists:

vocab = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
list1 = ['a', 'b', 'c', 'd', 'e']
list2 = ['f', 'g', 'h', 'i', 'j']

With the following code, I would like to get an encoding that creates a one-hot-encoding for list 1, but includes all the items from vocab.

import pandas as pd
encoding1 = pd.get_dummies(data= list1, columns= vocab)
encoding2 = pd.get_dummies(data= list2, columns= vocab)

I want the output:

encoding1 =      a   b   c   d   e   f   g   h   i   j
              1  1   0   0   0   0   0   0   0   0   0
              2  0   1   0   0   0   0   0   0   0   0 
              3  0   0   1   0   0   0   0   0   0   0
              4  0   0   0   1   0   0   0   0   0   0
              5  0   0   0   0   1   0   0   0   0   0

encoding2 =      a   b   c   d   e   f   g   h   i   j
              1  0   0   0   0   0   1   0   0   0   0
              2  0   0   0   0   0   0   1   0   0   0 
              3  0   0   0   0   0   0   0   1   0   0
              4  0   0   0   0   0   0   0   0   1   0
              5  0   0   0   0   0   0   0   0   0   1

However, I get the output:

encoding1 =      a   b   c   d   e   
              1  1   0   0   0   0  
              2  0   1   0   0   0  
              3  0   0   1   0   0  
              4  0   0   0   1   0   
              5  0   0   0   0   1   

encoding2 =      f   g   h   i   j   
              1  1   0   0   0   0  
              2  0   1   0   0   0  
              3  0   0   1   0   0  
              4  0   0   0   1   0   
              5  0   0   0   0   1  

What can I do to get the desired output?

ritsj
  • 11
  • 3
  • It is similar to the one asked here https://stackoverflow.com/questions/37425961/dummy-variables-when-not-all-categories-are-present – Nivi May 28 '18 at 12:25

2 Answers2

0

Try converting the dummies to data frames then we assign the columns to vocab then lots of NaN's will show up in the new columns then use the pandas fillna function for data frames and in the parameters we write 0 so it converts all the NaN's to 0:

encoding1 = pd.get_dummies(data= list1)
encoding2 = pd.get_dummies(data= list2)
df1 = pd.DataFrame(encoding1, columns=vocab)
df2 = pd.DataFrame(encoding2, columns=vocab)
print(df1.fillna(0))
print(df2.fillna(0))

Output:

df1
   a  b  c  d  e    f    g    h    i    j
0  1  0  0  0  0  0.0  0.0  0.0  0.0  0.0
1  0  1  0  0  0  0.0  0.0  0.0  0.0  0.0
2  0  0  1  0  0  0.0  0.0  0.0  0.0  0.0
3  0  0  0  1  0  0.0  0.0  0.0  0.0  0.0
4  0  0  0  0  1  0.0  0.0  0.0  0.0  0.0
df2
     a    b    c    d    e  f  g  h  i  j
0  0.0  0.0  0.0  0.0  0.0  1  0  0  0  0
1  0.0  0.0  0.0  0.0  0.0  0  1  0  0  0
2  0.0  0.0  0.0  0.0  0.0  0  0  1  0  0
3  0.0  0.0  0.0  0.0  0.0  0  0  0  1  0
4  0.0  0.0  0.0  0.0  0.0  0  0  0  0  1
U13-Forward
  • 69,221
  • 14
  • 89
  • 114
0

i would try

vocab_dummies = pd.get_dummies(data= vocab)

encoding1 = vocab_dummies.iloc[0:5,:]
encoding2 = vocab_dummies.iloc[5:vocab_dummies.shape[0],:].reset_index(drop=True)

encoding1
Out[67]: 
   a  b  c  d  e  f  g  h  i  j
0  1  0  0  0  0  0  0  0  0  0
1  0  1  0  0  0  0  0  0  0  0
2  0  0  1  0  0  0  0  0  0  0
3  0  0  0  1  0  0  0  0  0  0
4  0  0  0  0  1  0  0  0  0  0

encoding2
Out[68]: 
   a  b  c  d  e  f  g  h  i  j
0  0  0  0  0  0  1  0  0  0  0
1  0  0  0  0  0  0  1  0  0  0
2  0  0  0  0  0  0  0  1  0  0
3  0  0  0  0  0  0  0  0  1  0
4  0  0  0  0  0  0  0  0  0  1
nimrodz
  • 1,504
  • 1
  • 13
  • 18