construct sparse matrix using categorical data

Question

I have a data that looks something like this:

numpy array:

[[a, abc],
[b, def],
[c, ghi],
[d, abc],
[a, ghi],
[e, fg],
[f, f76],
[b, f76]]

its like a user-item matrix. I want to construct a sparse matrix with shape: number_of_items, num_of_users which gives 1 if the user has rated/bought an item or 0 if he hasn't. So, for the above example, shape should be (5,6). This is just an example, there are thousands of users and thousands of items.

Currently I'm doing this using two for loops. Is there any faster/pythonic way of achieving the same?

desired output:

1,0,0,1,0,0
0,1,0,0,0,0
1,0,1,0,0,0
0,0,0,0,1,0
0,1,0,0,0,1

where rows: abc,def,ghi,fg,f76 and columns: a,b,c,d,e,f

Could you explicitly give your desired output? – Daniel Lenz Aug 14 '15 at 11:17 — Daniel Lenz, Aug 14 '15 at 11:17
added in the question – Abhishek Thakur Aug 14 '15 at 11:24 — Abhishek Thakur, Aug 14 '15 at 11:24

score 3 · Accepted Answer · 2015-08-15T12:42:19.287

The easiest way is to assign integer labels to the users and items and use these as coordinates into the sparse matrix, for example:

import numpy as np
from scipy import sparse

users, I = np.unique(user_item[:,0], return_inverse=True)
items, J = np.unique(user_item[:,1], return_inverse=True)

points = np.ones(len(user_item), int)
mat = sparse.coo_matrix(points, (I, J))

score 2 · Answer 2 · answered Apr 17 '18 at 03:46

pandas.get_dummies provides the easier way to convert categorical columns to sparse matrix

import pandas as pd
#construct the data
x = pd.DataFrame([['a', 'abc'],['b', 'def'],['c' 'ghi'],
                 ['d', 'abc'],['a', 'ghi'],['e', 'fg'],
                 ['f', 'f76'],['b', 'f76']], 
                 columns = ['user','item'])
print(x)
#    user  item
# 0     a   abc
# 1     b   def
# 2     c   ghi
# 3     d   abc
# 4     a   ghi
# 5     e    fg
# 6     f   f76
# 7     b   f76
for col, col_data in x.iteritems():
    if str(col)=='item':
        col_data = pd.get_dummies(col_data, prefix = col)
        x = x.join(col_data)
print(x)
#    user  item  item_abc  item_def  item_f76  item_fg  item_ghi
# 0     a   abc         1         0         0        0         0
# 1     b   def         0         1         0        0         0
# 2     c   ghi         0         0         0        0         0
# 3     d   abc         1         0         0        0         0
# 4     a   ghi         0         0         0        0         1
# 5     e    fg         0         0         0        1         0
# 6     f   f76         0         0         1        0         0
# 7     b   f76         0         0         1        0         0

score 0 · Answer 3 · answered Aug 14 '15 at 13:42

Here's what I could come up with:

You need to be careful since np.unique will sort the items before returning them, so the output format is slightly different to the one you gave in the question.

Moreover, you need to convert the array to a list of tuples because ('a', 'abc') in [('a', 'abc'), ('b', 'def')] will return True, but ['a', 'abc'] in [['a', 'abc'], ['b', 'def']] will not.

A = np.array([
['a', 'abc'],
['b', 'def'],
['c', 'ghi'],
['d', 'abc'],
['a', 'ghi'],
['e', 'fg'],
['f', 'f76'],
['b', 'f76']])

customers = np.unique(A[:,0])
items = np.unique(A[:,1])
A = [tuple(a) for a in A]
combinations = it.product(customers, items)
C = np.array([b in A for b in combinations], dtype=int)
C.reshape((values.size, customers.size))
>> array(
  [[1, 0, 0, 0, 1, 0],
   [1, 1, 0, 0, 0, 0],
   [0, 0, 1, 1, 0, 0],
   [0, 0, 0, 0, 0, 1],
   [0, 0, 0, 1, 0, 0]])

score 0 · Answer 4 · edited May 23 '17 at 10:28

Here is my approach using pandas, let me know if it performed better:

#create dataframe from your numpy array
x = pd.DataFrame(x, columns=['User', 'Item'])

#get rows and cols for your sparse dataframe    
cols = pd.unique(x['User']); ncols = cols.shape[0]
rows = pd.unique(x['Item']); nrows = rows.shape[0]

#initialize your sparse dataframe, 
#(this is not sparse, but you can check pandas support for sparse datatypes    
spdf = pd.DataFrame(np.zeros((nrow, ncol)), columns=cols, index=rows)    

#define apply function    
def hasUser(xx):
    spdf.ix[xx.name,  xx] = 1

#groupby and apply to create desired output dataframe    
g = x.groupby(by='Item', sort=False)
g['User'].apply(lambda xx: hasUser(xx))

Here is the sampel dataframes for above code:

    spdf
    Out[71]: 
         a  b  c  d  e  f
    abc  1  0  0  1  0  0
    def  0  1  0  0  0  0
    ghi  1  0  1  0  0  0
    fg   0  0  0  0  1  0
    f76  0  1  0  0  0  1

    x
    Out[72]: 
      User Item
    0    a  abc
    1    b  def
    2    c  ghi
    3    d  abc
    4    a  ghi
    5    e   fg
    6    f  f76
    7    b  f76

Also, in case you want to make groupby apply function execution parallel , this question might be of help: Parallelize apply after pandas groupby

construct sparse matrix using categorical data

4 Answers4