9

I'm trying to do Naive Bayes on a dataset that has over 6,000,000 entries and each entry 150k features. I've tried to implement the code from the following link: Implementing Bag-of-Words Naive-Bayes classifier in NLTK

The problem is (as I understand), that when I try to run the train-method with a dok_matrix as it's parameter, it cannot find iterkeys (I've paired the rows with OrderedDict as labels):

Traceback (most recent call last):
  File "skitest.py", line 96, in <module>
    classif.train(add_label(matr, labels))
  File "/usr/lib/pymodules/python2.6/nltk/classify/scikitlearn.py", line 92, in train
    for f in fs.iterkeys():
  File "/usr/lib/python2.6/dist-packages/scipy/sparse/csr.py", line 88, in __getattr__
    return _cs_matrix.__getattr__(self, attr)
  File "/usr/lib/python2.6/dist-packages/scipy/sparse/base.py", line 429, in __getattr__
    raise AttributeError, attr + " not found"
AttributeError: iterkeys not found

My question is, is there a way to either avoid using a sparse matrix by teaching the classifier entry by entry (online), or is there a sparse matrix format I could use in this case efficiently instead of dok_matrix? Or am I missing something obvious?

Thanks for anyone's time. :)

EDIT, 6th sep:

Found the iterkeys, so atleast the code runs. It's still too slow, as it has taken several hours with a dataset of the size of 32k, and still hasn't finished. Here's what I got at the moment:

matr = dok_matrix((6000000, 150000), dtype=float32)
labels = OrderedDict()

#collect the data into the matrix

pipeline = Pipeline([('nb', MultinomialNB())])
classif = SklearnClassifier(pipeline)

add_label = lambda lst, lab: [(lst.getrow(x).todok(), lab[x])
                              for x in xrange(lentweets-foldsize)] 

classif.train(add_label(matr[:(lentweets-foldsize),0], labels))
readrow = [matr.getrow(x + foldsize).todok() for x in xrange(lentweets-foldsize)]
data = np.array(classif.batch_classify(readrow))

The problem might be that each row that is taken doesn't utilize the sparseness of the vector, but goes through each of the 150k entry. As a continuation for the issue, does anyone know how to utilize this Naive Bayes with sparse matrices, or is there any other way to optimize the above code?

Community
  • 1
  • 1
user1638859
  • 121
  • 1
  • 6
  • Perhaps you can encode your features more efficiently, or reduce their size? – piokuc Aug 31 '12 at 21:01
  • true, but whatever the number of features I'm afraid I'll still need to manage the size of the matrix. The dataset consists of tweets' words. – user1638859 Aug 31 '12 at 21:09
  • Found the iterkeys atleast, now the problem is that the code is too slow. – user1638859 Sep 06 '12 at 07:31
  • Do you need to do it in Python? Have a look at MALLET: http://mallet.cs.umass.edu/, it's pretty fast. – piokuc Sep 06 '12 at 08:31
  • 1
    No need for python as per se, but we got peeps here familiar with it. thanks, I'll check that out. Still, I suppose it would be nice to get a definitive solution for the large data-sets, so that anyone googling this problem will have an answer here. – user1638859 Sep 07 '12 at 11:29
  • Sure. BTW, there is an NLTK interface to MALLET, google for it. I've never used it, mallet is easy to use just as a command line tool, you just prepare the input (text) in files, use a command line tool to import the data to an internal Mallet format and then run Mallet itself with suitable options and get results in a text format, but I'm guess the Python interface is also useful. – piokuc Sep 07 '12 at 13:33
  • It looks like you're dealing with "tweet-length" documents here, have you seen [libshorttext](http://www.csie.ntu.edu.tw/~cjlin/libshorttext/) yet? I just started using it to do classification on a corpus of ~10million tweet sized documents, and it's super fast and accurate (I'm getting 80-90% accuracy with 6 categories and a training set of about 400 documents). And it's written in Python/C as a bonus! EDIT: and, I just realized this thread is almost a year old – sbrother May 15 '13 at 02:30
  • This link might be helpful: [Text Classification and Feature Hashing](http://blog.newsle.com/2013/02/01/text-classification-and-feature-hashing-sparse-matrix-vector-multiplication-in-cython/) – ely Oct 09 '13 at 17:31

1 Answers1

8

Check out the document classification example in scikit-learn. The trick is to let the library handle the feature extraction for you. Skip the NLTK wrapper, as it's not intended for such large datasets.(*)

If you have the documents in text files, then you can just hand those text files to the TfidfVectorizer, which creates a sparse matrix from them:

from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(input='filename')
X = vect.fit_transform(list_of_filenames)

You now have a training set X in the CSR sparse matrix format, that you can feed to a Naive Bayes classifier if you also have a list of labels y (perhaps derived from the filenames, if you encoded the class in them):

from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X, y)

If it turns out this doesn't work because the set of documents is too large (unlikely since the TfidfVectorizer was optimized for just this number of documents), look at the out-of-core document classification example, which demonstrates the HashingVectorizer and the partial_fit API for minibatch learning. You'll need scikit-learn 0.14 for this to work.

(*) I know, because I wrote that wrapper. Like the rest of NLTK, it's intended for educational purposes. I also worked on performance improvements in scikit-learn, and some of the code I'm advertising is my own.

Shaido
  • 27,497
  • 23
  • 70
  • 73
Fred Foo
  • 355,277
  • 75
  • 744
  • 836