82

I'm trying to use one of scikit-learn's supervised learning methods to classify pieces of text into one or more categories. The predict function of all the algorithms I tried just returns one match.

For example I have a piece of text:

"Theaters in New York compared to those in London"

And I have trained the algorithm to pick a place for every text snippet I feed it.

In the above example I would want it to return New York and London, but it only returns New York.

Is it possible to use scikit-learn to return multiple results? Or even return the label with the next highest probability?

Thanks for your help.

---Update

I tried using OneVsRestClassifier but I still only get one option back per piece of text. Below is the sample code I am using

y_train = ('New York','London')


train_set = ("new york nyc big apple", "london uk great britain")
vocab = {'new york' :0,'nyc':1,'big apple':2,'london' : 3, 'uk': 4, 'great britain' : 5}
count = CountVectorizer(analyzer=WordNGramAnalyzer(min_n=1, max_n=2),vocabulary=vocab)
test_set = ('nice day in nyc','london town','hello welcome to the big apple. enjoy it here and london too')

X_vectorized = count.transform(train_set).todense()
smatrix2  = count.transform(test_set).todense()


base_clf = MultinomialNB(alpha=1)

clf = OneVsRestClassifier(base_clf).fit(X_vectorized, y_train)
Y_pred = clf.predict(smatrix2)
print Y_pred

Result: ['New York' 'London' 'London']

petezurich
  • 9,280
  • 9
  • 43
  • 57
CodeMonkeyB
  • 2,970
  • 4
  • 22
  • 29

5 Answers5

112

What you want is called multi-label classification. Scikits-learn can do that. See here: http://scikit-learn.org/dev/modules/multiclass.html.

I'm not sure what's going wrong in your example, my version of sklearn apparently doesn't have WordNGramAnalyzer. Perhaps it's a question of using more training examples or trying a different classifier? Though note that the multi-label classifier expects the target to be a list of tuples/lists of labels.

The following works for me:

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier

X_train = np.array(["new york is a hell of a town",
                    "new york was originally dutch",
                    "the big apple is great",
                    "new york is also called the big apple",
                    "nyc is nice",
                    "people abbreviate new york city as nyc",
                    "the capital of great britain is london",
                    "london is in the uk",
                    "london is in england",
                    "london is in great britain",
                    "it rains a lot in london",
                    "london hosts the british museum",
                    "new york is great and so is london",
                    "i like london better than new york"])
y_train = [[0],[0],[0],[0],[0],[0],[1],[1],[1],[1],[1],[1],[0,1],[0,1]]
X_test = np.array(['nice day in nyc',
                   'welcome to london',
                   'hello welcome to new york. enjoy it here and london too'])   
target_names = ['New York', 'London']

classifier = Pipeline([
    ('vectorizer', CountVectorizer(min_n=1,max_n=2)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(X_train, y_train)
predicted = classifier.predict(X_test)
for item, labels in zip(X_test, predicted):
    print '%s => %s' % (item, ', '.join(target_names[x] for x in labels))

For me, this produces the output:

nice day in nyc => New York
welcome to london => London
hello welcome to new york. enjoy it here and london too => New York, London
starball
  • 20,030
  • 7
  • 43
  • 238
mwv
  • 4,221
  • 2
  • 19
  • 9
  • Hi, I tried using the example from multiclass but I still can't get multiple labels. I have updated my question with my code sample. What am I doing wrong? Thanks! – CodeMonkeyB May 11 '12 at 01:36
  • Hi thanks for your example. The problem is that i have training data for thousands of cities (with possible spellings names etc) To make it work as in the example i would have to create training sets for combinations of all possible city names. Is it possible instead to some how get the the probabilities of labels? For example for the last text the probability its new york is X% and probability its London is Y% sorted by probabilities. And then based on some sort of threshold I can grab labels with a certain probability or higher? – CodeMonkeyB May 13 '12 at 19:02
  • I don't think you need training data with all combinations of city names, I just added in the last two examples in `X_train` to make it more clear what `y_train` should look like. Under the hood, `OneVsRestClassifier` trains a separate classifier for each class, so you should be able to get the same results without training examples that combine city names. You can get the probability that a datapoint belongs to a class by calling `predict_proba` on a fitted classifier. This only works for appropriate classifiers. – mwv May 13 '12 at 20:25
  • 1
    I tried removing the last two training examples which combine the city names and I get: hello welcome to new york. enjoy it here and london too => New York It no longer returns two labels. For me its only returning two labels if I train the combinations of the two cities. Am I missing something? Thanks again for all your help – CodeMonkeyB May 14 '12 at 01:59
  • 1
    This is just a toy dataset, I wouldn't draw too many conclusions from that. Have you tried this procedure on your real data? – mwv May 14 '12 at 07:42
  • I tried this with my actual dataset and had the same results. I played around with the test dataset some more. It seems like this way only works if there are combinations of labels present in the training set. – CodeMonkeyB May 15 '12 at 02:10
  • 3
    @CodeMonkeyB: you should really accept this answer, it's correct from a programming point of view. Whether it works in practice depends on your data, not the code. – Fred Foo Nov 21 '12 at 11:00
  • Hi, thanks for this answer. I would add these lines. just for completeness import numpy as np from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.feature_extraction import text from sklearn.pipeline import Pipeline from sklearn.multiclass import OneVsRestClassifier from sklearn.svm import LinearSVC and target_names = {} target_names[0] = "NYC" target_names[1] = "London" – Jonathan Hendler Apr 01 '13 at 07:01
  • @MWv, http://scikit-learn.org/dev/modules/multiclass.html. says the NB is inherently multiclass and you don't need these metaclassifier. Some people are discussing that through the probability assigned to tags you can find multiple label, just pick top 3 probability. Did you know how to do this? – David Dec 12 '13 at 04:26
  • scikit 0.15 says "DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation." Maybe update this answer (see post of [J Maurer](http://stackoverflow.com/a/19172087/599739) for an example) – klamann Mar 27 '15 at 09:06
  • 3
    Is anyone else getting an issue with `min_n` and `max_n`. I need to change them to `ngram_range=(1,2)` to work – emmagras Jun 12 '15 at 15:03
  • @Cerin can you help me get this working? I had to use @emmagras 's fix, but whenever I try to run it I get `TypeError: 'str' object cannot be interpreted as an integer` – Rich Jun 11 '16 at 08:06
  • 1
    It is giving this error: ValueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead. – MANU Sep 12 '16 at 06:25
  • thanks a lot for this post and for the answer. i am wondering if such a classifier will detect new values of y? for example, if i put in berlin instead of new york or london, will it be picked up form the context? – AbtPst Jan 25 '17 at 14:43
  • 1
    How should I cross validate the multi-label classification? There will be many cases to check if the prediction correct. Let's say it will be correct if all the labels are correct, or if some of the labels are correct. – Light Feb 17 '17 at 03:59
  • 1
    this approach does not work any more ValueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead. – sariii Sep 07 '18 at 00:36
61

EDIT: Updated for Python 3, scikit-learn 0.18.1 using MultiLabelBinarizer as suggested.

I've been working on this as well, and made a slight enhancement to mwv's excellent answer that may be useful. It takes text labels as the input rather than binary labels and encodes them using MultiLabelBinarizer.

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer

X_train = np.array(["new york is a hell of a town",
                    "new york was originally dutch",
                    "the big apple is great",
                    "new york is also called the big apple",
                    "nyc is nice",
                    "people abbreviate new york city as nyc",
                    "the capital of great britain is london",
                    "london is in the uk",
                    "london is in england",
                    "london is in great britain",
                    "it rains a lot in london",
                    "london hosts the british museum",
                    "new york is great and so is london",
                    "i like london better than new york"])
y_train_text = [["new york"],["new york"],["new york"],["new york"],["new york"],
                ["new york"],["london"],["london"],["london"],["london"],
                ["london"],["london"],["new york","london"],["new york","london"]]

X_test = np.array(['nice day in nyc',
                   'welcome to london',
                   'london is rainy',
                   'it is raining in britian',
                   'it is raining in britian and the big apple',
                   'it is raining in britian and nyc',
                   'hello welcome to new york. enjoy it here and london too'])
target_names = ['New York', 'London']

mlb = MultiLabelBinarizer()
Y = mlb.fit_transform(y_train_text)

classifier = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])

classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)
all_labels = mlb.inverse_transform(predicted)

for item, labels in zip(X_test, all_labels):
    print('{0} => {1}'.format(item, ', '.join(labels)))

This gives me the following output:

nice day in nyc => new york
welcome to london => london
london is rainy => london
it is raining in britian => london
it is raining in britian and the big apple => new york
it is raining in britian and nyc => london, new york
hello welcome to new york. enjoy it here and london too => london, new york
J Maurer
  • 1,044
  • 10
  • 18
  • 13
    `labelBinarizer` is outdated. Use `lb = preprocessing.MultiLabelBinarizer()` instead – Roman Mar 04 '16 at 10:27
  • I have two questions: 1. Why does `it is raining in britian and the big apple ` only give New York, and not Britian too? And second, how should I convert this code to python 3? It runs fine in my server (python 2.7) but keeps giving me errors on my home PC (python3). – Rich Jun 11 '16 at 08:27
  • 1
    It doesn't give Britain because the only output labels are `New York` and `London`. – umop aplsdn Jul 07 '16 at 00:44
  • 2
    According to [scikit-learn](http://scikit-learn.org/dev/modules/multiclass.html#multiclass-and-multilabel-algorithms) One-Vs-All is supported by all linear models except sklearn.svm.SVC and also multilabel is supported by: Decision Trees, Random Forests, Nearest Neighbors, so I wouldn't use LinearSVC() for this type of task (a.k.a multilabel classification which I assume you want to use) – PeterB Mar 13 '17 at 16:40
  • 2
    Fyi One-Vs-All that @mindstorm mentions, corresponds to scikit-learn class "OneVsRestClassifier" (notice "Rest" rather than "all"). [This scikit-learn help page](http://scikit-learn.org/dev/modules/multiclass.html#multiclass-and-multilabel-algorithms) clarifies it. – lucid_dreamer Jun 08 '17 at 12:36
  • 1
    As @mindstorm mentions, It is true that at [this page](http://scikit-learn.org/dev/modules/multiclass.html#multiclass-and-multilabel-algorithms), the documentation mentions: "One-Vs-All: all linear models except sklearn.svm.SVC". However [another multilabel example from the scikit-learn documentation](http://scikit-learn.org/dev/auto_examples/plot_multilabel.html) shows a multilabel example with this line `classif = OneVsRestClassifier(SVC(kernel='linear'))`. Puzzled. – lucid_dreamer Jun 08 '17 at 12:44
  • Sorry for my confusing previous comment, it is partially wrong because I was confused as well. Actually you can use OneVsRestClassifier with whatever binary classifier you want (correct me if I am wrong). Documentation only says which scikit learn classifiers support One-Vs-Rest or multilabel classification by design e. g. they don't need any meta-estimator (like OneVsRestClassifier) for doing such task. It is perfectly correct to use LinearSVC with meta-estimator (OneVsRestClassifier) for doing multilabel classification if it provides sufficient predictions for your kind of data. – PeterB Jun 09 '17 at 14:47
  • 1
    Is there a way to have it handle no labels? For example, if I add a test example of "i want cookies" it labels it as both "New York" and "London" – Omar Meky Jun 30 '17 at 14:52
  • @OmarMeky you may be able to modify x_train and y_train_text to include a category of text a label of none – J Maurer Sep 25 '17 at 16:59
8

I just ran into this as well, and the problem for me was that my y_Train was a sequence of Strings, rather than a sequence of sequences of String. Apparently, OneVsRestClassifier will decide based on the input label format whether to use multi-class vs. multi-label. So change:

y_train = ('New York','London')

to

y_train = (['New York'],['London'])

Apparently this will disappear in the future, since it breaks of all the labels are the same: https://github.com/scikit-learn/scikit-learn/pull/1987

user2824135
  • 81
  • 1
  • 1
8

Change this line to make it work in new versions of python

# lb = preprocessing.LabelBinarizer()
lb = preprocessing.MultiLabelBinarizer()
Serjik
  • 10,543
  • 8
  • 61
  • 70
Srini Sydney
  • 564
  • 8
  • 17
2

Few Multi classification Examples are as under :-

Example 1:-

import numpy as np
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()

arr2d = np.array([1, 2, 3,4,5,6,7,8,9,10,11,12,13,14,1])
transfomed_label = encoder.fit_transform(arr2d)
print(transfomed_label)

Output is

[[1 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 1]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0]]

Example 2:-

import numpy as np
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()

arr2d = np.array(['Leopard','Lion','Tiger', 'Lion'])
transfomed_label = encoder.fit_transform(arr2d)
print(transfomed_label)

Output is

[[1 0 0]
 [0 1 0]
 [0 0 1]
 [0 1 0]]
Goyal Vicky
  • 1,249
  • 16
  • 16