How to convert row string values to features

Question

I have a dataframe which looks like the below one. It has the ID column and the products history of each customer.

ID      1    2    3    4
1       A    B    C    D
2       E    C    B    D
3       F    B    C    D

Instead of listing products for each customer, I would like to convert the products to features(columns) so that the data frame will look like this.

ID      A    B    C    D    E    F
1       1    1    1    1    0    0
2       0    0    0    1    1    0
3       0    1    1    1    0    1

I tried using get_dummies function, however, this will render different columns as 1-A, 1-E, 1-F, 2-B, 2-C, ....etc which is not what I need.

Any advice in getting this done.

You may need one hot encoding. See [How can I one hot encode in Python?](http://stackoverflow.com/questions/37292872/how-can-i-one-hot-encode-in-python). — Neo X, Feb 13 '17 at 05:07

Alex Fung · Answer 1 · 2017-02-13T07:03:54.783

2

This would yield a dataframe you want.

df = pd.get_dummies(df.set_index('ID').T.unstack()).groupby(level=0).sum().astype(int)

print (df)

Output:

    A  B  C  D  E  F
ID                  
1   1  1  1  1  0  0
2   0  1  1  1  1  0
3   0  1  1  1  0  1

edited Feb 13 '17 at 07:03

answered Feb 13 '17 at 06:11

Alex Fung

1,996
13
21

jezrael · Answer 2 · 2017-02-13T08:33:37.117

You can use get_dummies and aggregate with max:

print (pd.get_dummies(df.set_index('ID'), prefix_sep='', prefix='')
         .groupby(axis=1, level=0).max())

Or:

print (pd.get_dummies(df.set_index('ID').stack())
         .groupby(level=0).max().astype(int))

You can use custom function, but it is slow:

df = df.set_index('ID').apply(lambda x: pd.Series(dict(zip(x, [1]*len(df.columns)))), axis=1)
       .fillna(0)
       .astype(int)
print (df)
    A  B  C  D  E  F
ID                  
1   1  1  1  1  0  0
2   0  1  1  1  1  0
3   0  1  1  1  0  1

I was interesting about timings:

np.random.seed(123)
N = 10000
L = list('ABCDEFGHIJKLMNOPQRST')
#[10000 rows x 20 columns]
df = pd.DataFrame(np.random.choice(L, size=(N,5)))
df = df.rename_axis('ID').reset_index()
print (df.head())

#Alex Fung solution
In [160]: %timeit (pd.get_dummies(df.set_index('ID').T.unstack()).groupby(level=0).sum().astype(int))
10 loops, best of 3: 27.9 ms per loop

In [161]: %timeit (pd.get_dummies(df.set_index('ID').stack()).groupby(level=0).max().astype(int))
10 loops, best of 3: 26.3 ms per loop

In [162]: %timeit (pd.get_dummies(df.set_index('ID'), prefix_sep='', prefix='').groupby(axis=1, level=0).max())
10 loops, best of 3: 26.4 ms per loop

In [163]: %timeit (df.set_index('ID').apply(lambda x: pd.Series(dict(zip(x, [1]*len(df.columns)))), axis=1).fillna(0).astype(int))
1 loop, best of 3: 3.95 s per loop

`IndentationError: unexpected indent` is raised in Python 2.7. — Alex Fung, Feb 13 '17 at 07:09
Yes, then need only one row like `df = df.set_index('ID').apply(lambda x: pd.Series(dict(zip(x, [1]*len(df.columns)))), axis=1).fillna(0).astype(int)` or add `\\` to end of first and second row — jezrael, Feb 13 '17 at 07:10
You are right! Just a bit confusing as the code bit should run out of the box. Cool way to approach the problem ! Cheers. — Alex Fung, Feb 13 '17 at 07:13

How to convert row string values to features

2 Answers2