Dealing with a large amount of unique words for text processing/tf-idf etc

Question

I am using scikit to do some text processing, such as tfidf. The amount of filenames is being handled fine (~40k). But as far as the number of unique words, I am not able to deal with the array/matrix, whether it is to get the size of the amount of unique words printed, or to dump the numpy array to a file (using savetxt). Below is the traceback. If I could get the top values of the tfidf, as I dont need them for every single word for every single document. Or, I could exclude other words from the calculations (not stop words, but a separate set of words in a text file I could add that would be excluded). Though, I don't know if the words I would take out would alleviate this situation. Finally, if I could somehow grab pieces of the matrix, that could work too. Any example of dealing with this kind of thing will be helpful and give me some starting points of ideas. (PS I looked at and tried Hashingvectorizer but it doesnt seem I can do tfidf with it?)

Traceback (most recent call last):
  File "/sklearn.py", line 40, in <module>
    array = X.toarray()
  File "/home/kba/anaconda/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 790, in toarray
    return self.tocoo(copy=False).toarray(order=order, out=out)
  File "/home/kba/anaconda/lib/python2.7/site-packages/scipy/sparse/coo.py", line 239, in toarray
    B = self._process_toarray_args(order, out)
  File "/home/kba/anaconda/lib/python2.7/site-packages/scipy/sparse/base.py", line 699, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
ValueError: array is too big.

Relevant code:

path = "/home/files/"

fh = open('output.txt','w')


filenames = os.listdir(path)

filenames.sort()

try:
    filenames.remove('.DS_Store')
except ValueError:
    pass # or scream: thing not in some_list!
except AttributeError:
    pass # call security, some_list not quacking like a list!

vectorizer = CountVectorizer(input='filename', analyzer='word', strip_accents='unicode', stop_words='english') 
X=vectorizer.fit_transform(filenames)
fh.write(str(vectorizer.vocabulary_))

array = X.toarray()
print array.size
print array.shape

Edit: In case this helps,

print 'Array is:' + str(X.get_shape()[0])  + ' by ' + str(X.get_shape()[1]) + ' matrix.'

Get the dimension of the too large sparse matrix, in my case:

Array is: 39436 by 113214 matrix.

score 1 · Answer 1 · answered Nov 12 '13 at 09:33

1

The traceback holds the answer here: when you call X.toarray() at the end, it's converting a sparse matrix representation to a dense representation. This means that instead of storing a constant amount of data for each word in each document, you're now storing a value for all words over all documents.

Thankfully, most operations work with sparse matrices, or have sparse variants. Just avoid calling .toarray() or .todense() and you'll be good to go.

For more information, check out the scipy sparse matrix documentation.

answered Nov 12 '13 at 09:33

perimosocordiae

17,287
14
60
76

Thank you for the reply. all of my code works, when I do not use `code`.toarray()`code` or `code`.todense()`code` – KBA Nov 12 '13 at 20:44
What are other operations I can use to access the results, or part of the results? I see in the link the examples of performing operations on the matrix. Would you suggest a slicing method? – KBA Nov 12 '13 at 21:03
It depends on what you'd like to use the matrix for. If you just want to poke around, I'd suggest trying a smaller dataset and looking at it with `toarray()` first. Once you know what you want to do, you can find the sparse function to operate on the big data. – perimosocordiae Nov 12 '13 at 22:17
One thing I would like to do is, given an index, which would represent the column of the array, how could I extract those indices? For example, I have 500, and I want all of the values for that column index, for all rows. Without .toarray() I have been looking [link](http://stackoverflow.com/questions/14477448/efficient-slicing-of-matrices-using-matrix-multiplication-with-python-numpy-s/16507031#16507031)for other solutions that might work.. if you have any suggestions. – KBA Nov 13 '13 at 05:48
Next, for tfidf matrix, I wanted to get the top tfidf values. I saw how one could set max features amount for the tfidf vectorizer, but that is for the words with the top tf count. I want to still get the high values for the tfidf, which could include words with low tf. One idea I looked up is doing something like tf_idf_matrix.sum(axis=0), which would sum up the columns. This works in my code, but because of 113k columns, print wont show them all. If I could use something like argsort to access the top K column sum values, that would be helpful. – KBA Nov 13 '13 at 05:55
You can slice your sparse matrix like a regular array, depending on its type: http://stackoverflow.com/a/13352545/10601 And once you have your slice, you could use `.toarray()` to look at it, because it will likely be small enough to fit in memory. – perimosocordiae Nov 13 '13 at 20:42
For your second question, I'd suggest making another question on SO. – perimosocordiae Nov 13 '13 at 20:44

Dealing with a large amount of unique words for text processing/tf-idf etc

1 Answers1

Linked