Using scikit-learn classifier inside nltk, multiclass case

Question

Classification of text documents is a simple task with scikit-learn but there isn't a clean support of that in NLTK, also there are samples for doing that in hard way like this. I want to preprocess with NLTK and classify with sckit-learn and I found SklearnClassifier in NLTK, but there is a little problem.

In scikit-learn everything is OK:

from sklearn.naive_bayes import MultinomialNB
from sklearn.multiclass import OneVsRestClassifier

X_train = [[0, 0], [0, 1], [1, 1]]
y_train = [('first',), ('second',), ('first', 'second')]

clf = OneVsRestClassifier(MultinomialNB())
clf.fit(X_train, y_train)
print clf.classes_

The result is ['first' 'second'] and it's my expectation. But when I try to use same code in NLTK:

from nltk.classify import SklearnClassifier

X_train = [{'a': 1}, {'b': 1}, {'c': 1}]
y_train = [('first',), ('second',), ('first', 'second')]
clf = SklearnClassifier(OneVsRestClassifier(MultinomialNB()))
clf.train(zip(X_train, y_train))
print clf.labels()

The result is [('first',), ('second',), ('first', 'second')] and it isn't the proper one. Is there any solution?

Fred Foo · Accepted Answer · 2012-11-22T16:57:27.473

The NLTK wrapper for scikit-learn doesn't know about multilabel classification, and it shouldn't because it doesn't implement MultiClassifierI. Implementing that would require a separate class.

You can either implement the missing functionality, or use scikit-learn without the wrapper. Newer versions of scikit-learn have a DictVectorizer that accepts roughly the same inputs that the NLTK wrapper accepts:

from sklearn.feature_extraction import DictVectorizer

X_train_raw = [{'a': 1}, {'b': 1}, {'c': 1}]
y_train = [('first',), ('second',), ('first', 'second')]

v = DictVectorizer()
X_train = v.fit_transform(X_train_raw)

clf = OneVsRestClassifier(MultinomialNB())
clf.fit(X_train, y_train)

You can then use X_test = v.transform(X_test_raw) to transform test samples to matrices. A sklearn.pipeline.Pipeline makes this easier by tying a vectorizer and a classifier together in a single object.

Disclaimer: according to the FAQ, I should disclose my affiliation. I wrote both DictVectorizer and the NLTK wrapper for scikit-learn.

Using scikit-learn classifier inside nltk, multiclass case

1 Answers1