How to create dummy variable and then aggregate using scikit-learn?

Question

I know it can be easily realized using the package pandas, but because it is too sparse and large (170,000 x 5000), and at the end I need to use sklearn to deal with the data again, I'm wondering if there is a way to do with sklearn. I tried the one hot encoder, but got stuck to associate dummies with the 'id'.

df = pd.DataFrame({'id': [1, 1, 2, 2, 3, 3], 'item': ['a', 'a', 'c', 'b', 'a', 'b']})

   id item
0   1    a
1   1    a
2   2    c
3   2    b
4   3    a
5   3    b

dummy = pd.get_dummies(df, prefix='item', columns=['item'])
dummy.groupby('id').sum().reset_index()

   id  item_a  item_b  item_c
0   1       2       0       0
1   2       0       1       1
2   3       1       1       0

Update:

Now I'm here, and the 'id' is lost, how to do aggregation then?

lab = sklearn.preprocessing.LabelEncoder()
labels = lab.fit_transform(np.array(df.item))
enc = sklearn.preprocessing.OneHotEncoder()
dummy = enc.fit_transform(labels.reshape(-1,1))

dummy.todense()

matrix([[ 1.,  0.,  0.],
        [ 1.,  0.,  0.],
        [ 0.,  0.,  1.],
        [ 0.,  1.,  0.],
        [ 1.,  0.,  0.],
        [ 0.,  1.,  0.]])

Can you do the categories in a dataframe as you show and then use as_matrix() method to convert to numpy array representation? — Steve Misuta, Dec 22 '15 at 03:59
@SteveMisuta Yes I can do that. Can you please elaborate the reason? — user2165, Dec 22 '15 at 06:41
@Afflatus, I think I turned to scipy sparse matrix at the end. First use `df.groupby(['id','item']).size().reset_index().rename(columns={0:'count'})`, which takes some time but not days. Then use pivot table, which can be found [here](http://stackoverflow.com/a/31679396/4015990). It met my need at that time. Hope it helps, any comment, let me know. — user2165, May 03 '16 at 19:32

score 1 · Accepted Answer · edited May 23 '17 at 10:30

In case anyone needs a reference in the future, I put my solution here. I used scipy sparse matrix.

First, do a grouping and count the number of records.

df = df.groupby(['id','item']).size().reset_index().rename(columns={0:'count'})

This takes some time but not days.

Then use pivot table, which I found a solution here.

from scipy.sparse import csr_matrix

def to_sparse_pivot(df, id, item, count):
    id_u = list(df[id].unique())
    item_u = list(np.sort(df[item].unique()))
    data = df[count].tolist()
    row = df[id].astype('category', categories=id_u).cat.codes
    col = df[item].astype('category', categories=item_u).cat.codes
    return csr_matrix((data, (row, col)), shape=(len(id_u), len(item_u)))

Then call the function

result = to_sparse_pivot(df, 'id', 'item', 'count')

Alexander · Answer 2 · 2015-12-22T08:26:14.767

OneHotEncoder requires integers, so here is one way to map your items to a unique integer. Because the mapping is one-to-one, we can also reverse this dictionary.

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({'ID': [1, 1, 2, 2, 3, 3], 
                   'Item': ['a', 'a', 'c', 'b', 'a', 'b']})

mapping = {letter: integer for integer, letter in enumerate(df.Item.unique())}
reverse_mapping = {integer: letter for letter, integer in mapping.iteritems()}

>>> mapping
{'a': 0, 'b': 2, 'c': 1}

>>> reverse_mapping
{0: 'a', 1: 'c', 2: 'b'}

Now create a OneHotEncoder and map your values.

hot = OneHotEncoder()
h = hot.fit_transform(df.Item.map(mapping).values.reshape(len(df), 1))
>>> h
<6x3 sparse matrix of type '<type 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse Row format>
>>> h.toarray()
array([[ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 0.,  0.,  1.]])

And for reference, these would be the appropriate columns:

>>> [reverse_mapping[n] for n in reverse_mapping.keys()]
['a', 'c', 'b']

From your data, you can see that the value c in the dataframe was in the third row (with an index value of 2). This has been mapped to c which you can see from the reverse mapping is the middle column. It is also the only value in the middle column of the matrix to contain a value of one, confirming the result.

Beyond this, I'm not sure where you'd be stuck. If you still have issues, please clarify.

To concatenate the ID values:

>>> np.concatenate((df.ID.values.reshape(len(df), 1), h.toarray()), axis=1)
array([[ 1.,  1.,  0.,  0.],
       [ 1.,  1.,  0.,  0.],
       [ 2.,  0.,  1.,  0.],
       [ 2.,  0.,  0.,  1.],
       [ 3.,  1.,  0.,  0.],
       [ 3.,  0.,  0.,  1.]])

To keep the array sparse:

from scipy.sparse import hstack, lil_matrix

id_vals = lil_matrix(df.ID.values.reshape(len(df), 1))
h_dense = hstack([id_vals, h.tolil()])
>>> type(h_dense)
scipy.sparse.coo.coo_matrix

>>> h_dense.toarray()
array([[ 1.,  1.,  0.,  0.],
       [ 1.,  1.,  0.,  0.],
       [ 2.,  0.,  1.,  0.],
       [ 2.,  0.,  0.,  1.],
       [ 3.,  1.,  0.,  0.],
       [ 3.,  0.,  0.,  1.]])

Thank you for your answer, I'm stuck on the aggregation according to those 'id's. Please see the update in the original question for the detail. — user2165, Dec 22 '15 at 06:42
Something like the one I showed in the original question after the `dummy.groupby('id').sum().reset_index()`. Has to be a sparse matrix, otherwise it's going to take too much memory. BTW, I tried the pandas way with the real datasets, and the `groupby` step takes days, so I gave up before it can finish. — user2165, Dec 22 '15 at 06:52

How to create dummy variable and then aggregate using scikit-learn?

2 Answers2