0

I have some SVM classifier (LinearSVC) outputting final classifications for every sample in the test set, something like

1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1

and so on.

The "truth" labels is also something like

1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1

I would like to run that svm with some parameters, and generate points for the roc curve, and calculate auc.

I could do this by myself, but I am sure someone did it before me for cases like this.

Unfortunately, everything I can find is for cases where the classifier returns probabilities, rather than hard estimations, like here or here

I thought this would work, but from sklearn.metrics import plot_roc_curve is not found!

anything online that fits my case?

Thanks

Gulzar
  • 23,452
  • 27
  • 113
  • 201
  • Pleas print total error here. – abdoulsn Dec 07 '19 at 15:14
  • [see plotroc](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score) on sklearn siteweb. – abdoulsn Dec 07 '19 at 15:16
  • @abdoulsn I linked to that myself, I don't have `y_score`, I only have boolean results. – Gulzar Dec 07 '19 at 15:20
  • I'm a little confused as to why you think you don't have `y_score`. Those are the classes your model predicts, or the first array in your question. – m13op22 Dec 11 '19 at 15:02

2 Answers2

3

You could get around the problem by using sklearn.svm.SVC and setting the probability parameter to True.

As you can read:

probability: boolean, optional (default=False)

Whether to enable probability estimates. This must be enabled prior to calling fit, will slow down that method as it internally uses 5-fold cross-validation, and predict_proba may be inconsistent with predict. Read more in the User Guide.

As an example (details omitted):

from sklearn.svm import SVC
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

.
.
.

model = SVC(kernel="linear", probability=True)
model.fit(X_train, y_train)

.
.
.

decision_scores = model.decision_function(X_test)
fpr, tpr, thres = roc_curve(y_test, decision_scores)
print('AUC: {:.3f}'.format(roc_auc_score(y_test, decision_scores)))

# roc curve
plt.plot(fpr, tpr, "b", label='Linear SVM')
plt.plot([0,1],[0,1], "k--", label='Random Guess')
plt.xlabel("false positive rate")
plt.ylabel("true positive rate")
plt.legend(loc="best")
plt.title("ROC curve")
plt.show()

and you should get something like this:

enter image description here


NOTE that LinearSVC is MUCH FASTER than SVC(kernel="linear"), especially if the training set is very large or plenty of features.

sentence
  • 8,213
  • 4
  • 31
  • 40
1

You can use decision function here

from sklearn.svm import LinearSVC
from sklearn.datasets import make_classification
X, y = make_classification(n_features=4, random_state=0)
clf = LinearSVC(random_state=0, tol=1e-5)
clf.fit(X, y)
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=0, tol=1e-05, verbose=0)

print(clf.predict([[0, 0, 0, 0]]))
#>>[1]
print(clf.decision_function([[0, 0, 0, 0]]))
#>>[ 0.2841757]

The cleanest way would be to use Platt scaling to convert the distance to hyperplane as given by decision_function into a probability.

However, quick and dirty

[math.tanh(v)/2+0.5 for v in clf.decision_function([[0, 0, 0, 0],[1,1,1,1]])]
#>>[0.6383826839666699, 0.9635586809605969]

As Platts scaling is preserves the order of the example the result in the roc curve will be consistent.

In addition: Platt’s method is also known to have theoretical issues. If confidence scores are required, but these do not have to be probabilities, then it is advisable to set probability=False and use decision_function instead of predict_proba.

CAFEBABE
  • 3,983
  • 1
  • 19
  • 38