35

How can one use cross_val_score for regression? The default scoring seems to be accuracy, which is not very meaningful for regression. Supposedly I would like to use mean squared error, is it possible to specify that in cross_val_score?

Tried the following two but doesn't work:

scores = cross_validation.cross_val_score(svr, diabetes.data, diabetes.target, cv=5, scoring='mean_squared_error') 

and

scores = cross_validation.cross_val_score(svr, diabetes.data, diabetes.target, cv=5, scoring=metrics.mean_squared_error)

The first one generates a list of negative numbers while mean squared error should always be non-negative. The second one complains that:

mean_squared_error() takes exactly 2 arguments (3 given)
dorado
  • 1,515
  • 1
  • 15
  • 38
clwen
  • 20,004
  • 31
  • 77
  • 94
  • possible duplicate of [regression model evaluation using scikit-learn](http://stackoverflow.com/questions/23330827/regression-model-evaluation-using-scikit-learn) – Fred Foo Jun 10 '14 at 10:22

3 Answers3

42

I dont have the reputation to comment but I want to provide this link for you and/or a passersby where the negative output of the MSE in scikit learn is discussed - https://github.com/scikit-learn/scikit-learn/issues/2439

In addition (to make this a real answer) your first option is correct in that not only is MSE the metric you want to use to compare models but R^2 cannot be calculated depending (I think) on the type of cross-val you are using.

If you choose MSE as a scorer, it outputs a list of errors which you can then take the mean of, like so:

# Doing linear regression with leave one out cross val

from sklearn import cross_validation, linear_model
import numpy as np

# Including this to remind you that it is necessary to use numpy arrays rather 
# than lists otherwise you will get an error
X_digits = np.array(x)
Y_digits = np.array(y)

loo = cross_validation.LeaveOneOut(len(Y_digits))

regr = linear_model.LinearRegression()

scores = cross_validation.cross_val_score(regr, X_digits, Y_digits, scoring='mean_squared_error', cv=loo,)

# This will print the mean of the list of errors that were output and 
# provide your metric for evaluation
print scores.mean()
Sirrah
  • 1,681
  • 3
  • 21
  • 34
  • 16
    DeprecationWarning: Scoring method mean_squared_error was renamed to neg_mean_squared_error in version 0.18 and will be removed in 0.20. sample_weight=sample_weight) – Reza Amya Dec 22 '17 at 19:52
  • in this example, since `cv=loo` (only 1 test sample), the returned values inside `scores` is going to be the actual squared difference of the actual - predicted value for the current single (test) sample, is that right? – seralouk Aug 29 '19 at 14:54
  • ValueError: 'mean_squared_error' is not a valid scoring value. Use sorted(sklearn.metrics.SCORERS.keys()) to get valid options. – keramat Sep 11 '20 at 07:01
14

The first one is correct. It outputs the negative of the MSE, as it always tries to maximize the score. Please help us by suggesting an improvement to the documentation.

Abhinav Upadhyay
  • 2,477
  • 20
  • 32
Andreas Mueller
  • 27,470
  • 8
  • 62
  • 74
  • 1
    By "it always tries to maximize the score" do you mean that it makes them negative so the best score (smallest MSE magnitude) is always the largest? – DataMan Oct 25 '17 at 19:40
  • 5
    yes. We also now changed it to "neg_mean_squared_error" to make it more clear. – Andreas Mueller Oct 27 '17 at 16:07
  • 1
    just to make things clear in my head, it seems... neg_mean_squared_error = - (mean_squared_error). What is the reason for having neg_mean_sqaured_error in the first place? – haneulkim Jan 15 '20 at 02:03
  • @AndreasMueller it seems like scoring='neg_mean_squared_error' and scoring='r2' return the same value for RidgeCV in the scikit-learn code. Do you know anything about it? https://stackoverflow.com/a/41174343/2943352 – rmutalik Feb 24 '20 at 14:49
0
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer,mean_squared_error

scoring_metrics = make_scorer(mean_squared_error, 
                              greater_is_better=False
                             )

score = cross_val_score(model,
                        X_test,
                        y_test, 
                        cv=10, 
                        scoring=scoring_metrics)
mse = -score.mean()
mse
toyota Supra
  • 3,181
  • 4
  • 15
  • 19