I have a table of 23094592 (2*10^7) rows which gives the ratings given by 11701890 unique users for 1000000 unique items. I am trying to build the ratings matrix (11701890 * 1000000) of users vs. items for collaborative filtering.
Here is the pseudo-code of what I have implemented:
from scipy.sparse import csr_matrix
import cPickle
totalRows = 23094592
uniqueUsers = 11701890
uniqueItems = 1000000
M = csr_matrix((uniqueUsers, uniqueItems))
for i in range(totalRows):
M[userIndex,itemIndex] = ratings[i]
cPickle.dump(M, open('ratings.pkl', 'w+b'))
However, I have been running this code on a VM with RAM 52GB in google cloud and it has now taken about 2 whole days to complete only about 20% of the loop.
Also although M.data.bytes
for the sparse matrix ratings.pkl file showed to be around 100 MB
at some point of time, using du -sh
(the actual space used by that file was much more - about 3 GB
!!
This also led to memory issues and I had to increase the disc size of the VM in google cloud.
Could someone please suggest any way to iterate this huge loop in a faster way and to store the sparse matrices more memory efficiently.