Using MultilabelBinarizer on test data with labels not in the training set

Question

Given this simple example of multilabel classification (taken from this question, use scikit-learn to classify into multiple categories)

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn import preprocessing
from sklearn.metrics import accuracy_score

X_train = np.array(["new york is a hell of a town",
                "new york was originally dutch",
                "the big apple is great",
                "new york is also called the big apple",
                "nyc is nice",
                "people abbreviate new york city as nyc",
                "the capital of great britain is london",
                "london is in the uk",
                "london is in england",
                "london is in great britain",
                "it rains a lot in london",
                "london hosts the british museum",
                "new york is great and so is london",
                "i like london better than new york"])
y_train_text = [["new york"],["new york"],["new york"],["new york"],    ["new york"],
            ["new york"],["london"],["london"],["london"],["london"],
            ["london"],["london"],["new york","london"],["new york","london"]]

X_test = np.array(['nice day in nyc',
               'welcome to london',
               'london is rainy',
               'it is raining in britian',
               'it is raining in britian and the big apple',
               'it is raining in britian and nyc',
               'hello welcome to new york. enjoy it here and london too'])

y_test_text = [["new york"],["london"],["london"],["london"],["new york", "london"],["new york", "london"],["new york", "london"]]


lb = preprocessing.MultiLabelBinarizer()
Y = lb.fit_transform(y_train_text)
Y_test = lb.fit_transform(y_test_text)

classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])

classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)


print "Accuracy Score: ",accuracy_score(Y_test, predicted)

The code runs fine, and prints the accuracy score, however if I change y_test_text to

y_test_text = [["new york"],["london"],["england"],["london"],["new york", "london"],["new york", "london"],["new york", "london"]]

I get

Traceback (most recent call last):
  File "/Users/scottstewart/Documents/scikittest/example.py", line 52, in <module>
     print "Accuracy Score: ",accuracy_score(Y_test, predicted)
  File "/Library/Python/2.7/site-packages/sklearn/metrics/classification.py", line 181, in accuracy_score
differing_labels = count_nonzero(y_true - y_pred, axis=1)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/sparse/compressed.py", line 393, in __sub__
raise ValueError("inconsistent shapes")
ValueError: inconsistent shapes

Notice the introduction of the 'england' label which is not in the training set. How do I use multilabel classification so that if a "test" label is introduced, i can still run some some of metrics? Or is that even possible?

EDIT: Thanks for answers guys, I guess my question is more about how the scikit binarizer works or should work. Given my short sample code, i would also expect if i changed y_test_text to

y_test_text = [["new york"],["new york"],["new york"],["new york"],["new york"],["new york"],["new york"]]

That it would work--i mean we have fitted for that label, but in this case I get

ValueError: Can't handle mix of binary and multilabel-indicator

What do you mean by "some some of metrics"? There's no way the classifier will be able to predict labels it's never seen. — BrenBarn, Jul 19 '15 at 20:09
See my edited answer which I suppose cover all your question. — Geeocode, Jul 19 '15 at 21:36
Thanks Gyorgy! Thats what I needed. Should solve my larger problem — Scott Stewart, Jul 19 '15 at 21:43

Geeocode · Accepted Answer · 2015-07-19T21:30:41.897

You can, if you "introduce" the new label in the training y set too, like this:

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn import preprocessing
from sklearn.metrics import accuracy_score

X_train = np.array(["new york is a hell of a town",
                "new york was originally dutch",
                "the big apple is great",
                "new york is also called the big apple",
                "nyc is nice",
                "people abbreviate new york city as nyc",
                "the capital of great britain is london",
                "london is in the uk",
                "london is in england",
                "london is in great britain",
                "it rains a lot in london",
                "london hosts the british museum",
                "new york is great and so is london",
                "i like london better than new york"])
y_train_text = [["new york"],["new york"],["new york"],["new york"],    
                ["new york"],["new york"],["london"],["london"],         
                ["london"],["london"],["london"],["london"],
                ["new york","England"],["new york","london"]]

X_test = np.array(['nice day in nyc',
               'welcome to london',
               'london is rainy',
               'it is raining in britian',
               'it is raining in britian and the big apple',
               'it is raining in britian and nyc',
               'hello welcome to new york. enjoy it here and london too'])

y_test_text = [["new york"],["new york"],["new york"],["new york"],["new york"],["new york"],["new york"]]


lb = preprocessing.MultiLabelBinarizer(classes=("new york","london","England"))
Y = lb.fit_transform(y_train_text)
Y_test = lb.fit_transform(y_test_text)

print Y_test

classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])

classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)
print predicted

print "Accuracy Score: ",accuracy_score(Y_test, predicted)

Output:

Accuracy Score:  0.571428571429

The key section is:

y_train_text = [["new york"],["new york"],["new york"],
                ["new york"],["new york"],["new york"],
                ["london"],["london"],["london"],["london"],
                ["london"],["london"],["new york","England"],
                ["new york","london"]]

Where we inserted "England" too. It makes sense, because other way how can predict the classifier some label if he didn't see it before? So we created a three label classification problem this way.

EDITED:

lb = preprocessing.MultiLabelBinarizer(classes=("new york","london","England"))

You have to pass the classes as arg to MultiLabelBinarizer() and it will work with any y_test_text.

Great answer. Couple of recommendation. sklearn.metrics.accuracy_score() for multi-label classification computes a subset accuracy(_meaning does an exact match_). However, hamming_loss computes accuracy with respect to individual labels that got predicted. [Consistent Multilabel Classification](https://papers.nips.cc/paper/5883-consistent-multilabel-classification.pdf) — Pramit, Jun 21 '16 at 19:03

score 4 · Answer 2 · answered Jul 19 '15 at 19:27

4

In short - it is ill-posed problem. Classification assumes that all labels are known in advance, and so does binarizer . Fit it on all labels, and then train on any subset you want.

answered Jul 19 '15 at 19:27

lejlot

64,777
8
131
164

2

I think the inconvenience is that one might prefer MultiLabelBinarizer to simply ignore any labels it hasn't seen, rather than error. Compare with the behavior of CountVectorizer: If during its transform() method it sees tokens it didn't see during fit(), it will silently ignore them. This is often what you would want when, for example, transforming your test set using the same vectorizer you used to transform your training set. Similarly, when you use MultiLabelBinarizer to transform your test labels, you might want it to silently ignore anything you didn't see in training. – Stephen Aug 16 '17 at 03:22
1

This issue is more likely to come up when you are training a multilabel classifier with a very large number of labels. And especially when you're working with a subset of your data set during development. To work around the issue, I just manually clean up the labels in advance. – Stephen Aug 16 '17 at 03:23
I had a similar issue here: https://stats.stackexchange.com/questions/298046/achieving-consistency-between-training-test-target-representations-in-multilabe – Stephen Aug 16 '17 at 03:24

score 0 · Answer 3 · answered May 15 '18 at 14:03

As mentioned in another comment, personally I would expect the binarizer to ignore the not seen classes at "transform" time. The classifier which is consuming the outcome of the binarizer might not react well if the features presented by the testing samples are different than what was used on training.

I addressed the issue just removing the non seen classes from the sample. I think is a safer approach than dynamically changing the fitted binarizer or (another option) extending it to allow for ignoring.

list(map(lambda names: np.intersect1d(lb.classes_, names), y_test_text))

didn't run with you actual code

Using MultilabelBinarizer on test data with labels not in the training set

3 Answers3

Linked