14

I have a dictionary with keys as user_ids and values as list of movie_ids liked by that user with #unique_users = 573000 and # unique_movies =16000.

{1: [51, 379, 552, 2333, 2335, 4089, 4484], 2: [51, 379, 552, 1674, 1688, 2333, 3650, 4089, 4296, 4484], 5: [783, 909, 1052, 1138, 1147, 2676], 7: [171, 321, 959], 9: [3193], 10: [959], 11: [131,567,897,923],..........}

Now i want to convert this into into a matrix with rows as user_ids and columns as movies_id with values 1 for the movies which user has liked i.e it will be 573000*16000

Ultimately i have to multiply this matrix with it's transpose to have co-occurrence matrix with dim (#unique_movies,#unique_movies).

Also, what will be the time complexity of X'*X operation where X is like (500000,12000).

chirag yadav
  • 161
  • 1
  • 2
  • 7

3 Answers3

10

I think you can construct an empty dok_matrix and fill the values. Then transpose it and convert it to csr_matrix for efficient matrix multiplications.

import numpy as np
import scipy.sparse as sp
d = {1: [51, 379, 552, 2333, 2335, 4089, 4484], 2: [51, 379, 552, 1674, 1688, 2333, 3650, 4089, 4296, 4484], 5: [783, 909, 1052, 1138, 1147, 2676], 7: [171, 321, 959], 9: [3193], 10: [959], 11: [131,567,897,923]}

mat = sp.dok_matrix((573000,16000), dtype=np.int8)

for user_id, movie_ids in d.items():
    mat[user_id, movie_ids] = 1

mat = mat.transpose().tocsr()
print mat.shape
Zichen Wang
  • 1,294
  • 13
  • 22
  • 1
    but then for loop will be of 57300 iterations as that is the number of distinct users in the dictionary – chirag yadav Jun 16 '16 at 14:45
  • @chiragyadav I think that should be efficient because you've already indexed the your data in the dictionary and dok_matrix is efficient for constructing matrix incrementally. – Zichen Wang Jun 16 '16 at 14:54
  • `import scipy.sparse as sp mat = sp.dok_matrix((576808,11287), dtype=np.int8) for uid,brand_list in user_pref_dict.items(): mat[uid, brand_list] = 1` Tried the above code but it's throwing the below error: index (131) out of range -11287 to 11286) – chirag yadav Jun 16 '16 at 14:56
  • i was thinking of creating an array for each user, initiallizing it to 0 and populating it with ones using the indexes from the dict. Finally appending the array to the matrix as a new row. – Ma0 Jun 16 '16 at 14:57
  • @Ev. Kounis isn't it going to be very inefficient and time consuming considering the number of users – chirag yadav Jun 16 '16 at 15:00
  • @chiragyadav you may have numbers in `brand_list` larger than 11286? – Zichen Wang Jun 16 '16 at 15:03
  • @Zichen Wang thanks for pointing it out. the movies_ids also need not be in interval(0,#unique_ids). – chirag yadav Jun 16 '16 at 15:06
  • @chiragyadav You may want to find out the max indexes for rows and columns then construct the sparse matrix. – Zichen Wang Jun 16 '16 at 15:09
1
df = {1: [51, 379, 552, 2333, 2335, 4089, 4484], 2: [51, 379, 552, 1674, 1688, 2333, 3650, 4089, 4296, 4484], 5: [783, 909, 1052, 1138, 1147, 2676], 7: [171, 321, 959], 9: [3193], 10: [959], 11: [131,567,897,923],..........}
df2 = pd.DataFrame.from_dict(df, orient='index')
df2 = df2.stack().reset_index()
df2.level_1=1
df2.pivot(index='level_0',columns=0,values='level_1').fillna(0)

This converts the dict into a dataframe, followed by stacking to get userIDs and movieIDs in separate columns, then all the values of unused column level_1 is set to 1. Last statement creates a pivot table filling non-existant combinations with zeros.

user3404344
  • 1,707
  • 15
  • 13
0

You can create csr_matrix at once (like this format: csr_matrix((data, (row_ind, col_ind))). Here is a snippet on how to do that.

import scipy.sparse as sp
d = {0: [0,1], 1: [1,2,3], 
     2: [3,4,5], 3: [4,5,6], 
     4: [5,6,7], 5: [7], 
     6: [7,8,9]}
row_ind = [k for k, v in d.items() for _ in range(len(v))]
col_ind = [i for ids in d.values() for i in ids]
X = sp.csr_matrix(([1]*len(row_ind), (row_ind, col_ind))) # sparse csr matrix

You can use matrix X to find cooccurrence matrix later (i.e. X.T * X) (credit github @daniel-acuna). I guess there is a faster way to convert dictionary of list to row_ind, col_ind.

titipata
  • 5,321
  • 3
  • 35
  • 59