1

I have a dataframe which looks like the below one. It has the ID column and the products history of each customer.

ID      1    2    3    4
1       A    B    C    D
2       E    C    B    D
3       F    B    C    D

Instead of listing products for each customer, I would like to convert the products to features(columns) so that the data frame will look like this.

ID      A    B    C    D    E    F
1       1    1    1    1    0    0
2       0    0    0    1    1    0
3       0    1    1    1    0    1

I tried using get_dummies function, however, this will render different columns as 1-A, 1-E, 1-F, 2-B, 2-C, ....etc which is not what I need.

Any advice in getting this done.

Natheer Alabsi
  • 2,790
  • 4
  • 19
  • 28
  • 1
    You may need one hot encoding. See [How can I one hot encode in Python?](http://stackoverflow.com/questions/37292872/how-can-i-one-hot-encode-in-python). – Neo X Feb 13 '17 at 05:07

2 Answers2

2

This would yield a dataframe you want.

df = pd.get_dummies(df.set_index('ID').T.unstack()).groupby(level=0).sum().astype(int)

print (df)

Output:

    A  B  C  D  E  F
ID                  
1   1  1  1  1  0  0
2   0  1  1  1  1  0
3   0  1  1  1  0  1
Alex Fung
  • 1,996
  • 13
  • 21
2

You can use get_dummies and aggregate with max:

print (pd.get_dummies(df.set_index('ID'), prefix_sep='', prefix='')
         .groupby(axis=1, level=0).max())

Or:

print (pd.get_dummies(df.set_index('ID').stack())
         .groupby(level=0).max().astype(int))

You can use custom function, but it is slow:

df = df.set_index('ID').apply(lambda x: pd.Series(dict(zip(x, [1]*len(df.columns)))), axis=1)
       .fillna(0)
       .astype(int)
print (df)
    A  B  C  D  E  F
ID                  
1   1  1  1  1  0  0
2   0  1  1  1  1  0
3   0  1  1  1  0  1

I was interesting about timings:

np.random.seed(123)
N = 10000
L = list('ABCDEFGHIJKLMNOPQRST')
#[10000 rows x 20 columns]
df = pd.DataFrame(np.random.choice(L, size=(N,5)))
df = df.rename_axis('ID').reset_index()
print (df.head())

#Alex Fung solution
In [160]: %timeit (pd.get_dummies(df.set_index('ID').T.unstack()).groupby(level=0).sum().astype(int))
10 loops, best of 3: 27.9 ms per loop

In [161]: %timeit (pd.get_dummies(df.set_index('ID').stack()).groupby(level=0).max().astype(int))
10 loops, best of 3: 26.3 ms per loop

In [162]: %timeit (pd.get_dummies(df.set_index('ID'), prefix_sep='', prefix='').groupby(axis=1, level=0).max())
10 loops, best of 3: 26.4 ms per loop

In [163]: %timeit (df.set_index('ID').apply(lambda x: pd.Series(dict(zip(x, [1]*len(df.columns)))), axis=1).fillna(0).astype(int))
1 loop, best of 3: 3.95 s per loop
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • `IndentationError: unexpected indent` is raised in Python 2.7. – Alex Fung Feb 13 '17 at 07:09
  • Yes, then need only one row like `df = df.set_index('ID').apply(lambda x: pd.Series(dict(zip(x, [1]*len(df.columns)))), axis=1).fillna(0).astype(int)` or add `\\` to end of first and second row – jezrael Feb 13 '17 at 07:10
  • You are right! Just a bit confusing as the code bit should run out of the box. Cool way to approach the problem ! Cheers. – Alex Fung Feb 13 '17 at 07:13