I have a program that uses the SVC class from sklearn. Really, I'm using the OneVsRestClassifier class which uses the SVC class. My problem is that the predict_proba() method sometimes returns an vector that's too short. This is because the classes_ attribute is missing a class, which happens when a label isn't present during training.
Consider the following example (code shown below). Suppose all possible classes are 1, 2, 3, and 4. Now suppose training data just happens to not contain any data labeled with class 3. This is fine, except when I call predict_proba() I want a vector of length 4. Instead, I get a vector of length 3. That is, predict_proba() returns [p(1) p(2) p(4)], but I want [p(1) p(2) p(3) p(4)], where p(3) = 0.
I guess clf.classes_ is implicitly defined by the labels seen during training, which is incomplete in this case. Is there any way I can explicitly set the possible class labels? I know a simple work around is to just take the predict_proba() output and manually create the array I want. However, this is inconvenient and might slow my program down quite a bit.
# Python 2.7.6
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
import numpy as np
X_train = [[1], [2], [4]] * 10
y = [1, 2, 4] * 10
X_test = [[1]]
clf = OneVsRestClassifier(SVC(probability=True, kernel="linear"))
clf.fit(X_train, y)
# calling predict_proba() gives: [p(1) p(2) p(4)]
# I want: [p(1) p(2) p(3) p(4)], where p(3) = 0
print clf.predict_proba(X_test)
The work-around I had in mind creates a new list of probabilities and builds it one element at a time with multiple append() calls (see code below). This seems like it would be slow compared to having predict_proba() return what I want automatically. I don't know yet if it will significantly slow my program because I haven't tried it yet. Regardless, I wanted to know if there was a better way.
def workAround(probs, classes_, all_classes):
"""
probs: list of probabilities, output of predict_proba (but 1D)
classes_: clf.classes_
all_classes: all possible classes; superset of classes_
"""
all_probs = []
i = 0 # index into probs and classes_
for cls in all_classes:
if cls == classes_[i]:
all_probs.append(probs[i])
i += 1
else:
all_probs.append(0.0)
return np.asarray(all_probs)