Using sparse matrices/online learning in Naive Bayes (Python, scikit)

Question

I'm trying to do Naive Bayes on a dataset that has over 6,000,000 entries and each entry 150k features. I've tried to implement the code from the following link: Implementing Bag-of-Words Naive-Bayes classifier in NLTK

The problem is (as I understand), that when I try to run the train-method with a dok_matrix as it's parameter, it cannot find iterkeys (I've paired the rows with OrderedDict as labels):

Traceback (most recent call last):
  File "skitest.py", line 96, in <module>
    classif.train(add_label(matr, labels))
  File "/usr/lib/pymodules/python2.6/nltk/classify/scikitlearn.py", line 92, in train
    for f in fs.iterkeys():
  File "/usr/lib/python2.6/dist-packages/scipy/sparse/csr.py", line 88, in __getattr__
    return _cs_matrix.__getattr__(self, attr)
  File "/usr/lib/python2.6/dist-packages/scipy/sparse/base.py", line 429, in __getattr__
    raise AttributeError, attr + " not found"
AttributeError: iterkeys not found

My question is, is there a way to either avoid using a sparse matrix by teaching the classifier entry by entry (online), or is there a sparse matrix format I could use in this case efficiently instead of dok_matrix? Or am I missing something obvious?

Thanks for anyone's time. :)

EDIT, 6th sep:

Found the iterkeys, so atleast the code runs. It's still too slow, as it has taken several hours with a dataset of the size of 32k, and still hasn't finished. Here's what I got at the moment:

matr = dok_matrix((6000000, 150000), dtype=float32)
labels = OrderedDict()

#collect the data into the matrix

pipeline = Pipeline([('nb', MultinomialNB())])
classif = SklearnClassifier(pipeline)

add_label = lambda lst, lab: [(lst.getrow(x).todok(), lab[x])
                              for x in xrange(lentweets-foldsize)] 

classif.train(add_label(matr[:(lentweets-foldsize),0], labels))
readrow = [matr.getrow(x + foldsize).todok() for x in xrange(lentweets-foldsize)]
data = np.array(classif.batch_classify(readrow))

The problem might be that each row that is taken doesn't utilize the sparseness of the vector, but goes through each of the 150k entry. As a continuation for the issue, does anyone know how to utilize this Naive Bayes with sparse matrices, or is there any other way to optimize the above code?

Perhaps you can encode your features more efficiently, or reduce their size? — piokuc, Aug 31 '12 at 21:01
true, but whatever the number of features I'm afraid I'll still need to manage the size of the matrix. The dataset consists of tweets' words. — user1638859, Aug 31 '12 at 21:09
Found the iterkeys atleast, now the problem is that the code is too slow. — user1638859, Sep 06 '12 at 07:31
Do you need to do it in Python? Have a look at MALLET: http://mallet.cs.umass.edu/, it's pretty fast. — piokuc, Sep 06 '12 at 08:31
No need for python as per se, but we got peeps here familiar with it. thanks, I'll check that out. Still, I suppose it would be nice to get a definitive solution for the large data-sets, so that anyone googling this problem will have an answer here. — user1638859, Sep 07 '12 at 11:29
Sure. BTW, there is an NLTK interface to MALLET, google for it. I've never used it, mallet is easy to use just as a command line tool, you just prepare the input (text) in files, use a command line tool to import the data to an internal Mallet format and then run Mallet itself with suitable options and get results in a text format, but I'm guess the Python interface is also useful. — piokuc, Sep 07 '12 at 13:33
It looks like you're dealing with "tweet-length" documents here, have you seen [libshorttext](http://www.csie.ntu.edu.tw/~cjlin/libshorttext/) yet? I just started using it to do classification on a corpus of ~10million tweet sized documents, and it's super fast and accurate (I'm getting 80-90% accuracy with 6 categories and a training set of about 400 documents). And it's written in Python/C as a bonus! EDIT: and, I just realized this thread is almost a year old — sbrother, May 15 '13 at 02:30
This link might be helpful: [Text Classification and Feature Hashing](http://blog.newsle.com/2013/02/01/text-classification-and-feature-hashing-sparse-matrix-vector-multiplication-in-cython/) — ely, Oct 09 '13 at 17:31

score 8 · Answer 1 · edited Sep 13 '21 at 08:14

Check out the document classification example in scikit-learn. The trick is to let the library handle the feature extraction for you. Skip the NLTK wrapper, as it's not intended for such large datasets.(*)

If you have the documents in text files, then you can just hand those text files to the TfidfVectorizer, which creates a sparse matrix from them:

from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(input='filename')
X = vect.fit_transform(list_of_filenames)

You now have a training set X in the CSR sparse matrix format, that you can feed to a Naive Bayes classifier if you also have a list of labels y (perhaps derived from the filenames, if you encoded the class in them):

from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X, y)

If it turns out this doesn't work because the set of documents is too large (unlikely since the TfidfVectorizer was optimized for just this number of documents), look at the out-of-core document classification example, which demonstrates the HashingVectorizer and the partial_fit API for minibatch learning. You'll need scikit-learn 0.14 for this to work.

(*) I know, because I wrote that wrapper. Like the rest of NLTK, it's intended for educational purposes. I also worked on performance improvements in scikit-learn, and some of the code I'm advertising is my own.

Using sparse matrices/online learning in Naive Bayes (Python, scikit)

1 Answers1