2

I need to store word co-occurrence counts in several 14000x10000 matrices. Since I know the matrices will be sparse and I do not have enough RAM to store all of them as dense matrices, I am storing them as scipy.sparse matrices.

I have found the most efficient way to gather the counts to be using Counter objects. Now I need to transfer the counts from the Counter objects to the sparse matrices, but this takes too long. It currently takes on the order of 18 hours to populate the matrices.

The code I'm using is roughly as follows:

for word_ind1 in range(len(wordlist1)):
    for word_ind2 in range(len(wordlist2)):
        word_counts[word_ind2, word_ind1]=word_counters[wordlist1[word_ind1]][wordlist2[word_ind2]]

Where word_counts is a scipy.sparse.lil_matrix object, word_counters is a dictionary of counters, and wordlist1 and wordlist2 are lists of strings.

Is there any way to do this more efficiently?

  • What is `word_counts`? – Fred Foo Apr 01 '14 at 20:42
  • It's the sparse matrix I'm trying to populate. I just edited the question to say so. – user3486648 Apr 01 '14 at 20:48
  • Obviously, but what type of sparse matrix? – Fred Foo Apr 01 '14 at 20:49
  • Edited again to say it's a lil_matrix. It might be worthwhile trying other types of sparse matrices, but the SciPy documentation says this type is recommended for incrementally populating a matrix, and if the sparsity structure changes during population. – user3486648 Apr 01 '14 at 20:54
  • Do they still? I had a patch accepted that changed the recommendation for LIL matrices because actually constructing them can be extremely expensive unless items are pre-sorted. – Fred Foo Apr 01 '14 at 21:00
  • The documentation recommends both [LIL](http://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.sparse.lil_matrix.html) and [DOK](http://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.sparse.lil_matrix.html#scipy.sparse.lil_matrix) as "efficient structure[s] for constructing sparse matrices incrementally". Let me try DOK instead. I'll get back this thread once I do. Thanks for the help. – user3486648 Apr 01 '14 at 21:04
  • I'm writing a patch for SciPy right now. I've already suggested deprecating LIL, and the maintainers were not too negative about that idea. – Fred Foo Apr 01 '14 at 21:19

1 Answers1

2

You're using LIL matrices, which (unfortunately) have a linear-time insertion algorithm. Therefore, constructing them in this way takes quadratic time. Try a DOK matrix instead, those use hash tables for storage.

However, if you're interested in boolean term occurrences, then computing the co-occurrence matrix is much faster if you have a sparse term-document matrix. Let A be such a matrix of shape (n_documents, n_terms), then the co-occurrence matrix is

A.T * A
Fred Foo
  • 355,277
  • 75
  • 744
  • 836