6

I have a data that looks something like this:

numpy array:

[[a, abc],
[b, def],
[c, ghi],
[d, abc],
[a, ghi],
[e, fg],
[f, f76],
[b, f76]]

its like a user-item matrix. I want to construct a sparse matrix with shape: number_of_items, num_of_users which gives 1 if the user has rated/bought an item or 0 if he hasn't. So, for the above example, shape should be (5,6). This is just an example, there are thousands of users and thousands of items.

Currently I'm doing this using two for loops. Is there any faster/pythonic way of achieving the same?

desired output:

1,0,0,1,0,0
0,1,0,0,0,0
1,0,1,0,0,0
0,0,0,0,1,0
0,1,0,0,0,1

where rows: abc,def,ghi,fg,f76 and columns: a,b,c,d,e,f

Abhishek Thakur
  • 16,337
  • 15
  • 66
  • 97

4 Answers4

3

The easiest way is to assign integer labels to the users and items and use these as coordinates into the sparse matrix, for example:

import numpy as np
from scipy import sparse

users, I = np.unique(user_item[:,0], return_inverse=True)
items, J = np.unique(user_item[:,1], return_inverse=True)

points = np.ones(len(user_item), int)
mat = sparse.coo_matrix(points, (I, J))
2

pandas.get_dummies provides the easier way to convert categorical columns to sparse matrix

import pandas as pd
#construct the data
x = pd.DataFrame([['a', 'abc'],['b', 'def'],['c' 'ghi'],
                 ['d', 'abc'],['a', 'ghi'],['e', 'fg'],
                 ['f', 'f76'],['b', 'f76']], 
                 columns = ['user','item'])
print(x)
#    user  item
# 0     a   abc
# 1     b   def
# 2     c   ghi
# 3     d   abc
# 4     a   ghi
# 5     e    fg
# 6     f   f76
# 7     b   f76
for col, col_data in x.iteritems():
    if str(col)=='item':
        col_data = pd.get_dummies(col_data, prefix = col)
        x = x.join(col_data)
print(x)
#    user  item  item_abc  item_def  item_f76  item_fg  item_ghi
# 0     a   abc         1         0         0        0         0
# 1     b   def         0         1         0        0         0
# 2     c   ghi         0         0         0        0         0
# 3     d   abc         1         0         0        0         0
# 4     a   ghi         0         0         0        0         1
# 5     e    fg         0         0         0        1         0
# 6     f   f76         0         0         1        0         0
# 7     b   f76         0         0         1        0         0
Yung
  • 176
  • 4
0

Here's what I could come up with:

You need to be careful since np.unique will sort the items before returning them, so the output format is slightly different to the one you gave in the question.

Moreover, you need to convert the array to a list of tuples because ('a', 'abc') in [('a', 'abc'), ('b', 'def')] will return True, but ['a', 'abc'] in [['a', 'abc'], ['b', 'def']] will not.

A = np.array([
['a', 'abc'],
['b', 'def'],
['c', 'ghi'],
['d', 'abc'],
['a', 'ghi'],
['e', 'fg'],
['f', 'f76'],
['b', 'f76']])

customers = np.unique(A[:,0])
items = np.unique(A[:,1])
A = [tuple(a) for a in A]
combinations = it.product(customers, items)
C = np.array([b in A for b in combinations], dtype=int)
C.reshape((values.size, customers.size))
>> array(
  [[1, 0, 0, 0, 1, 0],
   [1, 1, 0, 0, 0, 0],
   [0, 0, 1, 1, 0, 0],
   [0, 0, 0, 0, 0, 1],
   [0, 0, 0, 1, 0, 0]])
Daniel Lenz
  • 3,334
  • 17
  • 36
0

Here is my approach using pandas, let me know if it performed better:

#create dataframe from your numpy array
x = pd.DataFrame(x, columns=['User', 'Item'])

#get rows and cols for your sparse dataframe    
cols = pd.unique(x['User']); ncols = cols.shape[0]
rows = pd.unique(x['Item']); nrows = rows.shape[0]

#initialize your sparse dataframe, 
#(this is not sparse, but you can check pandas support for sparse datatypes    
spdf = pd.DataFrame(np.zeros((nrow, ncol)), columns=cols, index=rows)    

#define apply function    
def hasUser(xx):
    spdf.ix[xx.name,  xx] = 1

#groupby and apply to create desired output dataframe    
g = x.groupby(by='Item', sort=False)
g['User'].apply(lambda xx: hasUser(xx))

Here is the sampel dataframes for above code:

    spdf
    Out[71]: 
         a  b  c  d  e  f
    abc  1  0  0  1  0  0
    def  0  1  0  0  0  0
    ghi  1  0  1  0  0  0
    fg   0  0  0  0  1  0
    f76  0  1  0  0  0  1

    x
    Out[72]: 
      User Item
    0    a  abc
    1    b  def
    2    c  ghi
    3    d  abc
    4    a  ghi
    5    e   fg
    6    f  f76
    7    b  f76

Also, in case you want to make groupby apply function execution parallel , this question might be of help: Parallelize apply after pandas groupby

Community
  • 1
  • 1
bitspersecond
  • 148
  • 1
  • 1
  • 7