Can I explicitly set the list of possible classes for an sklearn SVM?

Question

I have a program that uses the SVC class from sklearn. Really, I'm using the OneVsRestClassifier class which uses the SVC class. My problem is that the predict_proba() method sometimes returns an vector that's too short. This is because the classes_ attribute is missing a class, which happens when a label isn't present during training.

Consider the following example (code shown below). Suppose all possible classes are 1, 2, 3, and 4. Now suppose training data just happens to not contain any data labeled with class 3. This is fine, except when I call predict_proba() I want a vector of length 4. Instead, I get a vector of length 3. That is, predict_proba() returns [p(1) p(2) p(4)], but I want [p(1) p(2) p(3) p(4)], where p(3) = 0.

I guess clf.classes_ is implicitly defined by the labels seen during training, which is incomplete in this case. Is there any way I can explicitly set the possible class labels? I know a simple work around is to just take the predict_proba() output and manually create the array I want. However, this is inconvenient and might slow my program down quite a bit.

# Python 2.7.6

from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
import numpy as np

X_train = [[1], [2], [4]] * 10
y = [1, 2, 4] * 10
X_test = [[1]]

clf = OneVsRestClassifier(SVC(probability=True, kernel="linear"))
clf.fit(X_train, y)

# calling predict_proba() gives: [p(1) p(2) p(4)]
# I want: [p(1) p(2) p(3) p(4)], where p(3) = 0
print clf.predict_proba(X_test)

The work-around I had in mind creates a new list of probabilities and builds it one element at a time with multiple append() calls (see code below). This seems like it would be slow compared to having predict_proba() return what I want automatically. I don't know yet if it will significantly slow my program because I haven't tried it yet. Regardless, I wanted to know if there was a better way.

def workAround(probs, classes_, all_classes):
    """
    probs: list of probabilities, output of predict_proba (but 1D)
    classes_: clf.classes_
    all_classes: all possible classes; superset of classes_
    """
    all_probs = []
    i = 0  # index into probs and classes_

    for cls in all_classes:
        if cls == classes_[i]:
            all_probs.append(probs[i])
            i += 1
        else:
            all_probs.append(0.0)

    return np.asarray(all_probs)

If your task is multi-class and not multi-label, you don't need to use OneVsRestClassifier. Why do you think adding the additional column to the return might slow down your program? There is no automatic way, and I think we decided against adding one at some point, as it would add an additional argument to all classifiers and clutter the API. — Andreas Mueller, May 04 '15 at 18:09
I think it would be slow because I would be making a copy the probabilities for every call to predict_proba(). The new columns must be inserted into the correct place to preserve the sorted order of the classes. I'll get rid of OneVsRestClassifier and use just SVC, thanks. — Josh Kelle, May 04 '15 at 21:06
Well if scikit-learn would do it for you it would need to make the copy, too ;) And the cost of copying the array is negligible compared to making the prediction. You can get the place where your class needs to be entered from the classes_ attribute and probably np.searchsorted — Andreas Mueller, May 05 '15 at 13:31

Franck Dernoncourt · Answer 1 · 2015-08-24T23:26:19.810

As said in the comments, scikit-learn provides no way to explicitly set the possible class labels.

I NumPyfied your workaround:

import sklearn
import sklearn.svm
import numpy as np
np.random.seed(3) # for reproducibility

def predict_proba_ordered(probs, classes_, all_classes):
    """
    probs: list of probabilities, output of predict_proba 
    classes_: clf.classes_
    all_classes: all possible classes (superset of classes_)
    """
    proba_ordered = np.zeros((probs.shape[0], all_classes.size),  dtype=np.float)
    sorter = np.argsort(all_classes) # http://stackoverflow.com/a/32191125/395857
    idx = sorter[np.searchsorted(all_classes, classes_, sorter=sorter)]
    proba_ordered[:, idx] = probs
    return proba_ordered

# Prepare the data set
all_classes = np.array([1,2,3,4]) # explicitly set the possible class labels.
X_train = [[1], [2], [4]] * 3
print('X_train: {0}'.format(X_train))
y = [1, 2, 4] * 3 # Label 3 is missing.
print('y: {0}'.format(y))
X_test = [[1], [2], [3]]
print('X_test: {0}'.format(X_test))

# Train
clf = sklearn.svm.SVC(probability=True, kernel="linear")
clf.fit(X_train, y)
print('clf.classes_: {0}'.format(clf.classes_))

# Predict
probs = clf.predict_proba(X_test) #As label 3 isn't in train set, the probs' size is 3, not 4
proba_ordered = predict_proba_ordered(probs, clf.classes_, all_classes)
print('proba_ordered: {0}'.format(proba_ordered))

Output:

X_train: [[1], [2], [4], [1], [2], [4], [1], [2], [4]]
y: [1, 2, 4, 1, 2, 4, 1, 2, 4]
X_test: [[1], [2], [3]]
clf.classes_: [1 2 4]
proba_ordered: [[ 0.81499201  0.08640176  0.          0.09860622]
                [ 0.21105955  0.63893181  0.          0.15000863]
                [ 0.08965731  0.49640147  0.          0.41394122]]

Note that you can explicitly set the possible class labels in sklearn.metrics (e.g. sklearn.metrics.f1_score using the labels parameters:

labels : array
Integer array of labels.

Example:

# Score
y_pred = clf.predict(X_test)
y_true = np.array([1,2,3])
precision = sklearn.metrics.precision_score(y_true, y_pred, labels=all_classes, average=None)
print('precision: {0}'.format(precision))
recall = sklearn.metrics.recall_score(y_true, y_pred, labels=all_classes, average=None)
print('recall: {0}'.format(recall))
f1_score = sklearn.metrics.f1_score(y_true, y_pred, labels=all_classes, average=None)
print('f1_score: {0}'.format(f1_score))

Note that as of now you'll run into issue issue try using sklearn.metrics.roc_auc_score() when no positive example is in the ground truth for a given label .

Can I explicitly set the list of possible classes for an sklearn SVM?

1 Answers1