Scikit-learn is returning coefficient of determination (R^2) values less than -1

Question

I'm doing a simple linear model. I have

fire = load_data()
regr = linear_model.LinearRegression()
scores = cross_validation.cross_val_score(regr, fire.data, fire.target, cv=10, scoring='r2')
print scores

which yields

[  0.00000000e+00   0.00000000e+00  -8.27299054e+02  -5.80431382e+00
  -1.04444147e-01  -1.19367785e+00  -1.24843536e+00  -3.39950443e-01
   1.95018287e-02  -9.73940970e-02]

How is this possible? When I do the same thing with the built in diabetes data, it works perfectly fine, but for my data, it returns these seemingly absurd results. Have I done something wrong?

For this to happen with a `LinearRegression`, your model has to be so bad that predicting a simple average every time would be better. Usually this means that your model is over fitting. See my answer below for more details, or try setting `cv` to a smaller number. — mgoldwasser, Mar 21 '17 at 20:50

eickenberg · Accepted Answer · 2014-04-14T09:29:59.460

There is no reason r^2 shouldn't be negative (despite the ^2 in its name). This is also stated in the doc. You can see r^2 as the comparison of your model fit (in the context of linear regression, e.g a model of order 1 (affine)) to a model of order 0 (just fitting a constant), both by minimizing a squared loss. The constant minimizing the squared error is the mean. Since you are doing cross validation with left out data, it can happen that the mean of your test set is wildly different from the mean of your training set. This alone can induce a much higher incurred squared error in your prediction versus just predicting the mean of the test data, which results in a negative r^2 score.

In worst case, if your data do not explain your target at all, these scores can become very strongly negative. Try

import numpy as np
rng = np.random.RandomState(42)
X = rng.randn(100, 80)
y = rng.randn(100)  # y has nothing to do with X whatsoever
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(LinearRegression(), X, y, cv=5, scoring='r2')

This should result in negative r^2 values.

In [23]: scores
Out[23]: 
array([-240.17927358,   -5.51819556,  -14.06815196,  -67.87003867,
    -64.14367035])

The important question now is whether this is due to the fact that linear models just do not find anything in your data, or to something else that may be fixed in the preprocessing of your data. Have you tried scaling your columns to have mean 0 and variance 1? You can do this using sklearn.preprocessing.StandardScaler. As a matter of fact, you should create a new estimator by concatenating a StandardScaler and the LinearRegression into a pipeline using sklearn.pipeline.Pipeline. Next you may want to try Ridge regression.

Thanks for your help. I know that R^2 can be negative, but I thought it was supposed to be bounded to the interval [-1, 1]. Is that not the case? — rhombidodecahedron, Apr 12 '14 at 23:41
R^2 is bounded above by 1.0, but it is not bounded below. *Correlation* is always bounded between -1 and 1. — eickenberg, Apr 13 '14 at 18:10
Just because `R^2` can be negative it does not mean that we should expect it to be. Please see my answer below for reasons why `R^2` can be negative and how to fix them. — mgoldwasser, Mar 21 '17 at 20:47

mgoldwasser · Answer 2 · 2017-09-05T15:05:01.197

18

Just because R^2 can be negative does not mean it should be.

Possibility 1: a bug in your code.

A common bug that you should double check is that you are passing in parameters correctly:

r2_score(y_true, y_pred) # Correct!
r2_score(y_pred, y_true) # Incorrect!!!!

Possibility 2: small datasets

If you are getting a negative R^2, you could also check for over fitting. Keep in mind that cross_validation.cross_val_score() does not randomly shuffle your inputs, so if your sample are inadvertently sorted (by date for example) then you might build models on each fold that are not predictive for the other folds.

Try reducing the number of features, increasing the number samples, and decreasing the number of folds (if you are using cross_validation). While there is no official rule here, your m x n dataset (where m is the number of samples and n is the number of features) should be of a shape where

m > n^2

and when you using cross validation with f as the number of folds, you should aim for

m/f > n^2

edited Sep 05 '17 at 15:05

answered Mar 21 '17 at 20:43

mgoldwasser

14,558
15
79
103

1

Good point to look for bugs. Negative R^2 is definitely worth investigating! However, even if you do everything right R^2 can still be negative by pure stochasticity. As a matter of fact, the null distribution of predictive R^2 over Gaussian noise (i.e. data not predictable by the predictor) using a linear predictor is negative. (The estimated mean will be wrong, i.e. not 0, and the slope will also almost surely not be 0) – eickenberg Mar 21 '17 at 22:36
2

@eickenberg true, but I believe in most cases it will be slightly negative. the reason I actually found this question was because I was getting an `R^2` of approximately `-0.99`, and it turned out that I had simply flipped y_true and y_pred in `r2_score`. I imagine a lot of users will have similarly silly bugs. – mgoldwasser Mar 22 '17 at 14:23
Yes, interesting observation! Indeed if the predictions have less variance than the target (which is generally the case if e.g. additive noise is involved) this will make R^2 arbitrarily lower. Good to have this written here, it can lead to many people spending less time with this type of bug. – eickenberg Mar 22 '17 at 16:09
Had strongly negative R^2. It was reversed arguments issue as posted here, and after fixing the R^2 made sense. I had been plotting graphs and those showed apparent relationships so was scratching my head. Thank you! – Mark Andersen Sep 04 '20 at 17:28

score 13 · Answer 3 · answered Apr 14 '14 at 08:26

13

R² = 1 - RSS / TSS, where RSS is the residual sum of squares ∑(y - f(x))² and TSS is the total sum of squares ∑(y - mean(y))². Now for R² ≥ -1, it is required that RSS/TSS ≤ 2, but it's easy to construct a model and dataset for which this is not true:

>>> x = np.arange(50, dtype=float)
>>> y = x
>>> def f(x): return -100
...
>>> rss = np.sum((y - f(x)) ** 2)
>>> tss = np.sum((y - y.mean()) ** 2)
>>> 1 - rss / tss
-74.430972388955581

answered Apr 14 '14 at 08:26

Fred Foo

355,277
75
744
836

1

Exactly, the model just has to be "wrong enough", which is not difficult if you choose something that doesn't correspond at all. – eickenberg Apr 14 '14 at 09:35

score 2 · Answer 4 · answered Oct 26 '18 at 18:10

2

If you are getting negative regression r^2 scores, make sure to remove any unique identifier (e.g. "id" or "rownum") from your dataset before fitting/scoring the model. Simple check, but it'll save you some headache time.

answered Oct 26 '18 at 18:10

Alexus Wong

347
4
9

Scikit-learn is returning coefficient of determination (R^2) values less than -1

4 Answers4

Linked