I need to store word co-occurrence counts in several 14000x10000 matrices. Since I know the matrices will be sparse and I do not have enough RAM to store all of them as dense matrices, I am storing them as scipy.sparse matrices.
I have found the most efficient way to gather the counts to be using Counter objects. Now I need to transfer the counts from the Counter objects to the sparse matrices, but this takes too long. It currently takes on the order of 18 hours to populate the matrices.
The code I'm using is roughly as follows:
for word_ind1 in range(len(wordlist1)):
for word_ind2 in range(len(wordlist2)):
word_counts[word_ind2, word_ind1]=word_counters[wordlist1[word_ind1]][wordlist2[word_ind2]]
Where word_counts
is a scipy.sparse.lil_matrix object, word_counters
is a dictionary of counters, and wordlist1
and wordlist2
are lists of strings.
Is there any way to do this more efficiently?