im new to python and trying to one hot encode. My code is below:
import pandas as pd
from operator import add
df = pd.DataFrame([
[1895650,2,float("nan"), "2018-07-27"],
[1895650,4,float("nan"), "2018-08-13"],
[1896355,2,float("nan"), "2018-08-10"],
[1897675,9,12.0 ,"2018-08-13"],
[1897843,2,float("nan"),"2018-08-10"],
[2178737,3,1.0, "2019-06-14"],
[2178737,4,1.0, "2019-06-14"],
[2178737,7,1.0 , "2019-06-14"],
[2178737,1,1.0 , "2019-06-14"],
[2178750,6,4.0 , "2019-06-14"]],columns = ["Id","ServiceSubCodeKey","Aim","PrintDate"])
def sum_l(values):
out = []
for element in values:
out.append(element)
return out
def sum_l2(values):
if type(values[0]) != int:
out = values[0]
for i in range(1,len(values)):
out = list(map(add, out, values[i]))
else:
out = values
return out
columns = pd.get_dummies(df["ServiceSubCodeKey"]).astype(str)
df2 = columns[1]
for col in columns.columns[1::]:
df2 += columns[col]
df3 = pd.concat([df, df2], axis=1)
df3[1] = df3[1].apply(lambda x: list(map(int, list(x))))
df4 = df3[["Id",1]].groupby("Id").agg(lambda x: sum_l(x)).reset_index()
df4[1] = df4[1].apply(lambda x: sum_l2(x))
df4[1] = df4[1].apply(lambda x: ''.join(list(map(str, list(x)))))
def f(x):
while x[-1] == 0:
x.pop()
return x
df4[1] = df4[1].apply(lambda x: f(x))
df5 = pd.merge(df,df4, on="Id", how="left")
df5
Out[2]:
Id ServiceSubCodeKey Aim PrintDate 1
0 1895650 2 NaN 2018-07-27 0101000
1 1895650 4 NaN 2018-08-13 0101000
2 1896355 2 NaN 2018-08-10 0100000
3 1897675 9 12.0 2018-08-13 0000001
4 1897843 2 NaN 2018-08-10 0100000
5 2178737 3 1.0 2019-06-14 1011010
6 2178737 4 1.0 2019-06-14 1011010
7 2178737 7 1.0 2019-06-14 1011010
8 2178737 1 1.0 2019-06-14 1011010
9 2178750 6 4.0 2019-06-14 0000100
I am trying to one hot encode the service subcodes(ssc) associated to each ID. where lets say id 1895650 has two ssc's 2,4 then the encoding should be 0101. But as you see in my code the output shows as 0101000 for some reason. I do not need the additional 0's. Also, for id 2178750, the encoding is 0000100.This is wrong, It should be 000001.
What is the reason for these errors?