111

I'm using linear_model.LinearRegression from scikit-learn as a predictive model. It works and it's perfect. I have a problem to evaluate the predicted results using the accuracy_score metric.

This is my true Data :

array([1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0])

My predicted Data:

array([ 0.07094605,  0.1994941 ,  0.19270157,  0.13379635,  0.04654469,
    0.09212494,  0.19952108,  0.12884365,  0.15685076, -0.01274453,
    0.32167554,  0.32167554, -0.10023553,  0.09819648, -0.06755516,
    0.25390082,  0.17248324])

My code:

accuracy_score(y_true, y_pred, normalize=False)

Error message:

ValueError: Can't handle mix of binary and continuous target
desertnaut
  • 57,590
  • 26
  • 140
  • 166
Arij SEDIRI
  • 2,088
  • 7
  • 25
  • 43

8 Answers8

162

Despite the plethora of wrong answers here that attempt to circumvent the error by numerically manipulating the predictions, the root cause of your error is a theoretical and not computational issue: you are trying to use a classification metric (accuracy) in a regression (i.e. numeric prediction) model (LinearRegression), which is meaningless.

Just like the majority of performance metrics, accuracy compares apples to apples (i.e true labels of 0/1 with predictions again of 0/1); so, when you ask the function to compare binary true labels (apples) with continuous predictions (oranges), you get an expected error, where the message tells you exactly what the problem is from a computational point of view:

Classification metrics can't handle a mix of binary and continuous target

Despite that the message doesn't tell you directly that you are trying to compute a metric that is invalid for your problem (and we shouldn't actually expect it to go that far), it is certainly a good thing that scikit-learn at least gives you a direct and explicit warning that you are attempting something wrong; this is not necessarily the case with other frameworks - see for example the behavior of Keras in a very similar situation, where you get no warning at all, and one just ends up complaining for low "accuracy" in a regression setting...

I am super-surprised with all the other answers here (including the accepted & highly upvoted one) effectively suggesting to manipulate the predictions in order to simply get rid of the error; it's true that, once we end up with a set of numbers, we can certainly start mingling with them in various ways (rounding, thresholding etc) in order to make our code behave, but this of course does not mean that our numeric manipulations are meaningful in the specific context of the ML problem we are trying to solve.

So, to wrap up: the problem is that you are applying a metric (accuracy) that is inappropriate for your model (LinearRegression): if you are in a classification setting, you should change your model (e.g. use LogisticRegression instead); if you are in a regression (i.e. numeric prediction) setting, you should change the metric. Check the list of metrics available in scikit-learn, where you can confirm that accuracy is used only in classification.

Compare also the situation with a recent SO question, where the OP is trying to get the accuracy of a list of models:

models = []
models.append(('SVM', svm.SVC()))
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
#models.append(('SGDRegressor', linear_model.SGDRegressor())) #ValueError: Classification metrics can't handle a mix of binary and continuous targets
#models.append(('BayesianRidge', linear_model.BayesianRidge())) #ValueError: Classification metrics can't handle a mix of binary and continuous targets
#models.append(('LassoLars', linear_model.LassoLars())) #ValueError: Classification metrics can't handle a mix of binary and continuous targets
#models.append(('ARDRegression', linear_model.ARDRegression())) #ValueError: Classification metrics can't handle a mix of binary and continuous targets
#models.append(('PassiveAggressiveRegressor', linear_model.PassiveAggressiveRegressor())) #ValueError: Classification metrics can't handle a mix of binary and continuous targets
#models.append(('TheilSenRegressor', linear_model.TheilSenRegressor())) #ValueError: Classification metrics can't handle a mix of binary and continuous targets
#models.append(('LinearRegression', linear_model.LinearRegression())) #ValueError: Classification metrics can't handle a mix of binary and continuous targets

where the first 6 models work OK, while all the rest (commented-out) ones give the same error. By now, you should be able to convince yourself that all the commented-out models are regression (and not classification) ones, hence the justified error.

A last important note: it may sound legitimate for someone to claim:

OK, but I want to use linear regression and then just round/threshold the outputs, effectively treating the predictions as "probabilities" and thus converting the model into a classifier

Actually, this has already been suggested in several other answers here, implicitly or not; again, this is an invalid approach (and the fact that you have negative predictions should have already alerted you that they cannot be interpreted as probabilities). Andrew Ng, in his popular Machine Learning course at Coursera, explains why this is a bad idea - see his Lecture 6.1 - Logistic Regression | Classification at Youtube (explanation starts at ~ 3:00), as well as section 4.2 Why Not Linear Regression [for classification]? of the (highly recommended and freely available) textbook An Introduction to Statistical Learning by Hastie, Tibshirani and coworkers...

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • I agree; why use linear regression when we have logistic? But, in ISL the second-to-last paragraph of that section (in the seventh printing?), the authors seem to suggest that it actually may not be so bad in the binary classification case: "it can be shown that the $X\hat{\beta}$ obtained using linear regression is in fact an estimate of $Pr(\text{drug overdose}\mid X)$ in this special case" and "the classifications...will be the same as for the linear discriminant analysis (LDA) procedure". Any insight there? – Ben Reiniger Jun 02 '20 at 18:10
  • This and one other answer is correct, well explained. – PKumar Aug 01 '20 at 15:08
  • "_OK, but I want to use linear regression and then just round/threshold the outputs, effectively treating the predictions as 'probabilities'..._" Isn't that exactly what a logistic regression is? A linear regression with sigmoid/softmax function converting (possibly negative) logits into probabilities? – Super-intelligent Shade Aug 21 '22 at 20:13
  • 1
    @Super-intelligentShade it is most certainly *not* - the cost function to be minimized (cross entropy) is different, too. Please notice that further details on such non-programming issues are off-topic in SO. Should you still have questions or doubts, please consider posting at Stats SE. – desertnaut Aug 21 '22 at 23:29
22

accuracy_score is a classification metric, you cannot use it for a regression problem.

You can see the available regression metrics in the docs.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Amey Kumar Samala
  • 904
  • 1
  • 7
  • 20
7

The problem is that the true y is binary (zeros and ones), while your predictions are not. You probably generated probabilities and not predictions, hence the result :) Try instead to generate class membership, and it should work!

JohnnyQ
  • 425
  • 4
  • 16
  • `LinearRegression` produces numeric predictions, and not probabilities; the issue is due to the attempt to use accuracy in a regression setting, which is meaningless, hence the error... – desertnaut Jan 31 '19 at 00:10
6

The sklearn.metrics.accuracy_score(y_true, y_pred) method defines y_pred as:

y_pred : 1d array-like, or label indicator array / sparse matrix. Predicted labels, as returned by a classifier.

Which means y_pred has to be an array of 1's or 0's (predicated labels). They should not be probabilities.

The predicated labels (1's and 0's) and/or predicted probabilites can be generated using the LinearRegression() model's methods predict() and predict_proba() respectively.

1. Generate predicted labels:

LR = linear_model.LinearRegression()
y_preds=LR.predict(X_test)
print(y_preds)

output:

[1 1 0 1]

y_preds can now be used for the accuracy_score() method: accuracy_score(y_true, y_pred)

2. Generate probabilities for labels:

Some metrics such as 'precision_recall_curve(y_true, probas_pred)' require probabilities, which can be generated as follows:

LR = linear_model.LinearRegression()
y_preds=LR.predict_proba(X_test)
print(y_preds)

output:

[0.87812372 0.77490434 0.30319547 0.84999743]
Chris Tang
  • 567
  • 7
  • 18
MLKing
  • 1,253
  • 1
  • 12
  • 9
  • 1
    `LinearRegression` returns numeric predictions, and certainly *not* probabilities; the latter are returned by *logistic* regression models. – desertnaut Jan 31 '19 at 00:16
  • 2
    scikit-learn's `LinearRegression` does **not** include a `predict_proba` method ([docs](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)), and it would indeed be strange if it did. Did you actually run the code snippets you show here? – desertnaut Jan 31 '19 at 23:59
  • Friendly advice: keeping wrong and invalid answers there just because they happened to get some upvotes is neither a good idea nor how SO works. I kindly suggest you delete this one (in the long run, it will be better for your reputation, too). – desertnaut May 22 '20 at 12:19
6

This resolve same problem for me, use .round() for preditions,

accuracy_score(y_true, y_pred.round(), normalize=False)
Kiran
  • 1,176
  • 2
  • 14
  • 27
1

I was facing the same issue.The dtypes of y_test and y_pred were different. Make sure that the dtypes are same for both. The dtypes of y_test and y_pred were different

-2

The error is because difference in datatypes of y_pred and y_true. y_true might be dataframe and y_pred is arraylist. If you convert both to arrays, then issue will get resolved.

Sreenath Nukala
  • 27
  • 1
  • 2
  • 9
-2

accuracy_score is a classification metric, you cannot use it for a regression problem.

Use this way:

accuracy_score(y_true, np.round(abs(y_pred)), normalize=False) 
Paul Roub
  • 36,322
  • 27
  • 84
  • 93
Amir Alizadeh
  • 41
  • 1
  • 7
  • 2
    You started correctly, and you should have stopped there; attempting to manipulate the results as you suggest is invalid, and it actually contradicts your first (correct) statement. Notice that this was exactly the suggestion in the initial, accepted (and wrong) answer, which is now deleted. – desertnaut Oct 18 '21 at 20:55