0

I have created pre-processed data. Now, I would like to vectorize it and write it on a text file. While transforming vectorizer object to array, I get this error. What could be possible solutions?

from sklearn.feature_extraction.text import CountVectorizer
    import numpy as np
    vectorizer = CountVectorizer(analyzer = "word",   \
                                 tokenizer = None,    \
                                 preprocessor = None, \
                                 stop_words = None,   \
                                 max_features = 1000)
    newTestFile = open("testfile.txt", 'r', encoding='latin-1')
    featureVector=vectorizer.fit_transform(newTestFile)
    train_data_features = featureVector.toarray()
    np.savetxt('plotFeatureVector.txt', train_data_features, fmt="%10s %10.3f")

The error:

    Traceback (most recent call last):
      File "C:/Users/NuMA/Desktop/Lecture Stuff/EE 485/Project/Deneme/bagOfWords.py", line 12, in <module>
        train_data_features = featureVector.toarray()
      File "C:\Users\NuMA\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scipy\sparse\compressed.py", line 964, in toarray
        return self.tocoo(copy=False).toarray(order=order, out=out)
      File "C:\Users\NuMA\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scipy\sparse\coo.py", line 252, in toarray
        B = self._process_toarray_args(order, out)
      File "C:\Users\NuMA\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scipy\sparse\base.py", line 1039, in _process_toarray_args
        return np.zeros(self.shape, dtype=self.dtype, order=order)
    ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.
  • You are not transforming a vectorizer object, `featureVector` is a sparse matrix. – juanpa.arrivillaga Mar 22 '17 at 21:17
  • Possible duplicate of [Save / load scipy sparse csr\_matrix in portable data format](http://stackoverflow.com/questions/8955448/save-load-scipy-sparse-csr-matrix-in-portable-data-format) – juanpa.arrivillaga Mar 22 '17 at 21:17
  • In particular, you should use the **np.savez / np.load** approach in [this](http://stackoverflow.com/a/42101691/5014455) answer of the dupe-target. – juanpa.arrivillaga Mar 22 '17 at 21:21
  • The latest `scipy.sparse` (1.19?) has a `save/load` pair of functions like that `savez` approach. – hpaulj Mar 22 '17 at 21:38

1 Answers1

0

vectorizer has created a large sparse matrix, featureVector.

featureVector.toarray() (I usually use featureVector.A) is supposed to create a dense (regular numpy) array from that. Evidently the required size is too large.

Could you print repr(featureVector)? This should show the shape, dtype and number of nonzero terms of this matrix. I'm guessing it has millions of rows and thousands of columns.

So even if it did work, I doubt if the savetxt with fmt="%10s %10.3f" would work. Or thatcsv` file of such a large array would be usable.

So make sure you understand what vectorizer is producing. And rethink this task of creating a dense array from the result and saving it.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • nnz = 12110452, shape = (290988,1000),dtype = int64,format = csr. I know it is a large matrix but it should not be so hard to write it on an array. –  Mar 23 '17 at 12:35
  • Have tried to make a `np.zeros(...)` that big directly? – hpaulj Mar 23 '17 at 13:26