1

im new to python and trying to one hot encode. My code is below:

import pandas as pd
from operator import add
df = pd.DataFrame([
[1895650,2,float("nan"),  "2018-07-27"],
[1895650,4,float("nan"),  "2018-08-13"],
[1896355,2,float("nan"),  "2018-08-10"],
[1897675,9,12.0 ,"2018-08-13"],
[1897843,2,float("nan"),"2018-08-10"],
[2178737,3,1.0,  "2019-06-14"],
[2178737,4,1.0, "2019-06-14"],
[2178737,7,1.0 , "2019-06-14"],
[2178737,1,1.0 , "2019-06-14"],
[2178750,6,4.0 , "2019-06-14"]],columns = ["Id","ServiceSubCodeKey","Aim","PrintDate"])


def sum_l(values):
    out = []
    for element in values:
        out.append(element)
    return out

def sum_l2(values):
    
    if type(values[0]) != int:
        out = values[0]
        for i in range(1,len(values)):
            out = list(map(add, out, values[i]))
    else:
        out = values
    return out


columns = pd.get_dummies(df["ServiceSubCodeKey"]).astype(str)
df2 = columns[1]
for col in columns.columns[1::]:
    df2 += columns[col]
  
df3 = pd.concat([df, df2], axis=1)
df3[1] = df3[1].apply(lambda x: list(map(int, list(x))))
    
df4 = df3[["Id",1]].groupby("Id").agg(lambda x: sum_l(x)).reset_index()
df4[1] = df4[1].apply(lambda x: sum_l2(x))
df4[1] = df4[1].apply(lambda x: ''.join(list(map(str, list(x)))))
def f(x):
    while x[-1] == 0:
        x.pop()
    return x

df4[1] = df4[1].apply(lambda x: f(x))
df5 = pd.merge(df,df4, on="Id", how="left")
df5

Out[2]: 
        Id  ServiceSubCodeKey   Aim   PrintDate        1
0  1895650                  2   NaN  2018-07-27  0101000
1  1895650                  4   NaN  2018-08-13  0101000
2  1896355                  2   NaN  2018-08-10  0100000
3  1897675                  9  12.0  2018-08-13  0000001
4  1897843                  2   NaN  2018-08-10  0100000
5  2178737                  3   1.0  2019-06-14  1011010
6  2178737                  4   1.0  2019-06-14  1011010
7  2178737                  7   1.0  2019-06-14  1011010
8  2178737                  1   1.0  2019-06-14  1011010
9  2178750                  6   4.0  2019-06-14  0000100

I am trying to one hot encode the service subcodes(ssc) associated to each ID. where lets say id 1895650 has two ssc's 2,4 then the encoding should be 0101. But as you see in my code the output shows as 0101000 for some reason. I do not need the additional 0's. Also, for id 2178750, the encoding is 0000100.This is wrong, It should be 000001.

What is the reason for these errors?

3 Answers3

0

The problem is with the function f(x)

def f(x):
    while x[-1] == 0:
        x.pop()
    return x

df4[1] contains strings at this point (df4.loc[0, 1] == '0101000'), so you need to modify the function f(x) to trim strings instead of lists. You can replace the code above with

def f(x):
    while x[-1] == "0":
        x = x[:-1]
    return x

But what is the reason id 2178750 still gives the wrong encoding of 00001 inatead of 000001 where the ssc is 6

Note: this explains why this happens given your method of one-hot encoding. Please see my suggested encoding method below, which solves this error.

The values for ssc are not continuous (i.e., 5 and 8 are missing). These one-hot encoding, then, encodes whether each unique value is in the ssc for an Id. Below, consider the one-hot encodings for Id 1895650 and 2178750. This should demonstrate why the one-hot encoding for 2178750 is 00001 instead of 000001.

| 1 | 2 | 3 | 4 | 6 | 8 | 9 |
|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 1 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 1 | 0 | 0 |

You can also one-hot encode your data with the code below (with inspiration from https://stackoverflow.com/a/37323404/5666087).

import numpy as np
import pandas as pd

df = pd.DataFrame([
  [1895650, 2, float("nan"), "2018-07-27"],
  [1895650, 4, float("nan"), "2018-08-13"],
  [1896355, 2, float("nan"), "2018-08-10"],
  [1897675, 9, 12.0, "2018-08-13"],
  [1897843, 2, float("nan"), "2018-08-10"],
  [2178737, 3, 1.0, "2019-06-14"],
  [2178737, 4, 1.0, "2019-06-14"],
  [2178737, 7, 1.0, "2019-06-14"],
  [2178737, 1, 1.0, "2019-06-14"],
  [2178750, 6, 4.0, "2019-06-14"]
], columns = ["Id", "ServiceSubCodeKey", "Aim", "PrintDate"])


n_values = df["ServiceSubCodeKey"].max() + 1
onehot = np.eye(n_values, dtype=int)[df["ServiceSubCodeKey"]]
onehot = pd.DataFrame(data=onehot, index=df.loc[:, "Id"])
# Remove first column because we start at 1 instead of 0.
onehot = onehot.iloc[:, 1:]
onehot = onehot.groupby(onehot.index).sum().astype(str).sum(1, numeric_only=False)
onehot = onehot.str.rstrip("0")
onehot

The onehot dataframe is

Id
1895650         0101
1896355           01
1897675    000000001
1897843           01
2178737      1011001
2178750       000001
dtype: object
jkr
  • 17,119
  • 2
  • 42
  • 68
  • But what is the reason id 2178750 still gives the wrong encoding of 00001 inatead of 000001 where the ssc is 6 –  Jul 27 '20 at 16:17
  • That comes from `pd.get_dummies(df["ServiceSubCodeKey"])`. There are no ssc == 5, so ssc == 6 will be 00001. – jkr Jul 27 '20 at 16:22
  • The numpy method gave an error AttributeError: Can only use .str accessor with string values! –  Jul 27 '20 at 16:53
  • Are you running it exactly as I have written it? If it still doesn't work for you, you can replace the last line with `onehot.astype(str).str.rstrip("0")`. – jkr Jul 27 '20 at 16:59
0

Not related to why it fails, but here is a simpler solution:

def dummy(arr):
    out_arr = np.zeros(np.max(arr))
    out_arr[arr - 1] = 1
    return ''.join(str(x) for x in out_arr.astype(int))

df.groupby('Id')['ServiceSubCodeKey'].apply(lambda x: dummy(np.sort(np.array(x))))

Here we group ServiceSubCodeKey by Id and get an array of ServiceSubCodeKey values per Id, which we then process with a custom dummy function to get desired output.

0

I believe the reason why the code for 2178750 is missing zeros is because not all digits are present in scc (5 and 8 are missing, so the code for 1897675 with only 9 also has a couple of leading zeros missing).

Here's a suggestion for a different enconder:

columns = (
    pd.get_dummies(df["ServiceSubCodeKey"])
    .reindex(range(df.ServiceSubCodeKey.min(),
        df.ServiceSubCodeKey.max()+1), axis=1, fill_value=0)
    # now it has all digits
    .astype(str)
    )
codes = pd.Series(
    [int(''.join(row)) for row in columns.itertuples(index=False)],
    index=df.index)
codes = (
    codes.groupby(df.Id).transform('sum').astype('str')
    .str.pad(width=columns.shape[1], fillchar='0')
    .str.rstrip('0') # this will remove trailing 0's, but I strongly recommend you keep them
    )

print(codes)

Output

0         0101
1         0101
2           01
3    000000001
4           01
5      1011001
6      1011001
7      1011001
8      1011001
9       000001
dtype: object
RichieV
  • 5,103
  • 2
  • 11
  • 24
  • #EmbraceTheForce of Pandas – RichieV Jul 27 '20 at 16:38
  • Thank you Richie. This was exactly what i was looking for. irrespective of a ssc being in the df the encoding should reflect the number. so thanks. i marked it correct. Just one more question, since im new to py, im not certain this how can I append this, so my final output has this as well as my original DF columns –  Jul 28 '20 at 04:53
  • use `df = df.assign(one_hot_ssc=codes)` ... Pandas has a great user guide, worth reading in depth https://pandas.pydata.org/docs/getting_started/index.html – RichieV Jul 28 '20 at 04:59
  • Hi Richie, I have encountered n issue OverflowError: int too large to convert to float when I have tried this on a larger df –  Jul 28 '20 at 16:16
  • Then skip `codes = pd.Series(...` and do `codes = columns.groupby(df.Id).transform('sum').astype(str).agg(''.join, axis=1)` – RichieV Jul 28 '20 at 17:28
  • How will you use the joined str encoding? Wouldn't you rather keep it as a dataframe?... by removing `.astype().agg()` – RichieV Jul 28 '20 at 17:33
  • I tried the above but it is giving me a wrong encoding unfortunately –  Jul 28 '20 at 17:37
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/218766/discussion-between-jay-janardhan-and-richiev). –  Jul 28 '20 at 17:43