2

I have a table of 23094592 (2*10^7) rows which gives the ratings given by 11701890 unique users for 1000000 unique items. I am trying to build the ratings matrix (11701890 * 1000000) of users vs. items for collaborative filtering.

Here is the pseudo-code of what I have implemented:

from scipy.sparse import csr_matrix
import cPickle

totalRows = 23094592
uniqueUsers = 11701890
uniqueItems = 1000000
M = csr_matrix((uniqueUsers, uniqueItems))
for i in range(totalRows):
      M[userIndex,itemIndex] = ratings[i]
cPickle.dump(M, open('ratings.pkl', 'w+b'))

However, I have been running this code on a VM with RAM 52GB in google cloud and it has now taken about 2 whole days to complete only about 20% of the loop.

Also although M.data.bytes for the sparse matrix ratings.pkl file showed to be around 100 MB at some point of time, using du -sh (the actual space used by that file was much more - about 3 GB !!

This also led to memory issues and I had to increase the disc size of the VM in google cloud.

Could someone please suggest any way to iterate this huge loop in a faster way and to store the sparse matrices more memory efficiently.

Community
  • 1
  • 1
aveek
  • 180
  • 7
  • how many cores does the cpu you're using has? Given the x number of your cores, you can manage to create y threads that will work on different portions of the matrix, so let's say you have 4 cores and 8 threads, you can designate each thread to work on 1/8 of the matrix, this will optimize the entire process. – Cristofor Apr 22 '17 at 10:33
  • please provide small (5-7 rows) reproducible sample (fake) data set (in text/CSV format) and desired data set. I guess it might be possible to get rid of loops (using vectorized operations) - this might speed up the whole process dramatically. – MaxU - stand with Ukraine Apr 22 '17 at 10:35
  • 1
    `pickle` really isn't going to be much good here, even JSON would be faster I/O but still not suitable. It might be all but impossible to load your final file back in in any meaningful way at the end. At a guess, maybe HDF5 format, but I couldn't say for sure. – roganjosh Apr 22 '17 at 10:46
  • @roganjosh, [here is a very good question + answers](http://stackoverflow.com/questions/8955448/save-load-scipy-sparse-csr-matrix-in-portable-data-format) describing and comparing different methods of saving & loading of sparse matrixes – MaxU - stand with Ukraine Apr 22 '17 at 10:55
  • @MaxU very interesting thanks, it's going to take me some time to get my head around this properly tbh (`pickle` failing on very large matrices does not surprise me). I have come across [this](http://www.benfrederickson.com/dont-pickle-your-data/) before, but perhaps it's not in the context of using `HIGHEST_PROTOCOL`. If I get time I will compare with HDF5, which I've used to good effect in the past... something doesn't feel right about `cPickle` being in this arena :) EDIT: Actually, `HIGHEST_PROTOCOL` is in the graph and slower than JSON. – roganjosh Apr 22 '17 at 11:12
  • 1
    Haven't you gotten a Sparse Efficiency warning? Creating a sparse matrix iteratively is bad practice, especially with csr format. `lil` format is a little better. – hpaulj Apr 22 '17 at 11:49
  • @MaxU 100,12,4 is one such row in the ratings.csv where 100 is the userId, 12 is the itemId and 4 is the rating given by user100 for item 12 in a scale of 10. I did try numpy vectorizer, but it did not give any significant boost in speed as compared to for loop. – aveek Apr 23 '17 at 07:35
  • @hpaulj yes, I did get that. But initializing the lil matrix was taking more time than csr_matrix. So, decided to csr_matrix. – aveek Apr 23 '17 at 07:39
  • @Christian Had tried something like this for multiprocessing But was quite disappointed that it too didnt give any boost in speed. – aveek Apr 23 '17 at 07:40
  • `lil_matrix((uniqueUsers, uniqueItems))` should be fast since it doesn't have to do anything; just make an empty matrix. But regardless, creating the matrix once from `coo` style inputs, as shown in the accepted answer, is better than iterative assignment. – hpaulj Apr 23 '17 at 07:46

1 Answers1

2

As described in scipy.sparse.csr_matrix docs you can create a sparse matrix in the following way:

csr_matrix((data, (row_ind, col_ind)), [shape=(M, N)])

where data, row_ind and col_ind satisfy the relationship

a[row_ind[k], col_ind[k]] = data[k].

I tried this with some random-generated data of the same size and matrix creation took about 18 seconds.

from scipy.sparse import csr_matrix
import numpy as np
totalRows = 23094592
UniqueUsers = 11701890
UniqueItems = 1000000
users = np.random.randint(UniqueUsers, size=totalRows)
items = np.random.randint(UniqueItems, size=totalRows)
ratings = np.random.randint(5, size=totalRows)
M = csr_matrix((ratings, (users, items)), shape=(UniqueUsers, UniqueItems))
    

This will create a sparse matrix where M[users[k], items[k]] = ratings[k].

You should make sure that users and items are 0-based indices of every unique user and item, perhaps by using LabelEncoder from scikit-learn.

Community
  • 1
  • 1
gereleth
  • 2,452
  • 12
  • 21