3

I have the following df:

 id    step1 step2 step3 step4 .... stepn-1, stepn, event 
  1       a     b    c     null         null  null    1
  2       b     d    f     null         null  null    0
  3       a     d    g      h             l    m      1      

Where the id is a session, the steps represent a certain path, and event is whether something specific happened

I want to create a feature store where we take all the possible steps (a, b, c, ... all the way to some arbitrary number) and make them the columns. Then I want the x-column to remain the id and it just fill a 1 or zero if that session hit that step in the column. The result is below:

id  a  b  c  d  e  f  g ... n event
 1  1  1  1  0  0  0  0     0   1
 2  0  1  0  0  0  1  0     0   0
 3  1  0  0  1  0  0  1     1   1

I have a unique list of all the possible steps which I assume will be used to construct the new table. But after that I am struggling thinking how to create this.

S44
  • 473
  • 2
  • 10
  • This might help [`One Hot Encoding in Pandas`](https://stackoverflow.com/questions/37292872/how-can-i-one-hot-encode-in-python). – Mayank Porwal Nov 16 '21 at 18:56

2 Answers2

3

What you are looking for is often used in machine learning, and is called one-hot encoding.

There is a pandas function specifically designed for this purpose, called pd.get_dummies().

step_cols = [c for c in df.columns if c.startswith('step')]
other_cols = [c for c in df.columns if not c.startswith('step')]

new_df = pd.get_dummies(df[step_cols].stack()).groupby(level=0).max()
new_df[other_cols] = df[other_cols]

Output:

>>> new_df
   a  b  c  d  f  g  h  l  m  id  event
0  1  1  1  0  0  0  0  0  0   1      1
1  0  1  0  1  1  0  0  0  0   2      0
2  1  0  0  1  0  1  1  1  1   3      1
  • I went in that direction too at first. But I got the double columns (here `b`). There's probably an easy way to consolidate them? – Timus Nov 16 '21 at 19:34
  • Oh yeah, wow, can't believe I didn't see that. There is an easy way to consolidate them. –  Nov 16 '21 at 20:23
  • You should probably get dummies on the stack `df[step_cols].stack()` – Quang Hoang Nov 17 '21 at 16:53
  • 1
    No, I mean `get_dummies` on the stacked data, not the data itself. `get_dummies(df)` gets the dummies for **each column** separately, then concatenate (join) them horizontally. So something like `pd.get_dummies(df[step_cols].stack()).max(level=0)` would work without the extra step from your other question. – Quang Hoang Nov 17 '21 at 17:06
  • I am reviewing the solution soon :) thanks. – S44 Nov 18 '21 at 18:41
  • @user17242583 are you familiar with this error? I got it on line 4 of your solution: Unable to allocate 6.06 GiB for an array with shape (1760410, 3699) and data type uint8 – S44 Nov 18 '21 at 18:47
2

Probably not the most elegant way:

step_cols = [col for col in df.columns if col.startswith("step")]
values = pd.Series(sorted(set(df[step_cols].melt().value.dropna())))
df1 = pd.DataFrame(
    (values.isin(row).to_list() for row in zip(*(df[col] for col in step_cols))),
    columns=values
).astype(int)
df = pd.concat([df.id, df1, df.event], axis=1)

Result for

df =
   id step1 step2 step3 step4  event
0   1     a     b     c   NaN      1
1   2     b     d     f   NaN      0
2   3     a     d     g     h      1

is

   id  a  b  c  d  f  g  h  event
0   1  1  1  1  0  0  0  0      1
1   2  0  1  0  1  1  0  0      0
2   3  1  0  0  1  0  1  1      1
Timus
  • 10,974
  • 5
  • 14
  • 28