47

I have a dataframe with observed and modelled data, and I would like to calculate the R2 value. I expected there to be a function I could call for this, but can't locate one. I know I can write my own and apply it, but am I missing something obvious? I want something like

obs <- 1:5
mod <- c(0.8,2.4,2,3,4.8)
df <- data.frame(obs, mod)

R2 <- rsq(df)
# 0.85
Maxim
  • 52,561
  • 27
  • 155
  • 209
Esme_
  • 1,360
  • 3
  • 18
  • 30
  • 6
    And the inevitable - "what about Adjusted R^2?" question is addressed here - http://stats.stackexchange.com/questions/48703/what-is-the-adjusted-r-squared-formula-in-lm-in-r-and-how-should-it-be-interpret – thelatemail Dec 01 '16 at 02:22
  • 1
    You may need to specify how this "modelled data" was created. The r.squared is reported by summary functions associated with regression functions. But only when such an estiamte is statsitically justified. – IRTFM Dec 01 '16 at 02:22
  • 1
    I have several different models, but it's not simply the output of a regression model such as lm, so the summary functions don't work. Basically I need to be able to compare between two vectors (Zheyuan Li's suggestion works perfectly for this) – Esme_ Dec 01 '16 at 03:06

6 Answers6

98

You need a little statistical knowledge to see this. R squared between two vectors is just the square of their correlation. So you can define you function as:

rsq <- function (x, y) cor(x, y) ^ 2

Sandipan's answer will return you exactly the same result (see the following proof), but as it stands it appears more readable (due to the evident $r.squared).


Let's do the statistics

Basically we fit a linear regression of y over x, and compute the ratio of regression sum of squares to total sum of squares.

lemma 1: a regression y ~ x is equivalent to y - mean(y) ~ x - mean(x)

lemma 1

lemma 2: beta = cov(x, y) / var(x)

lemma 2

lemma 3: R.square = cor(x, y) ^ 2

lemma 3


Warning

R squared between two arbitrary vectors x and y (of the same length) is just a goodness measure of their linear relationship. Think twice!! R squared between x + a and y + b are identical for any constant shift a and b. So it is a weak or even useless measure on "goodness of prediction". Use MSE or RMSE instead:

I agree with 42-'s comment:

The R squared is reported by summary functions associated with regression functions. But only when such an estimate is statistically justified.

R squared can be a (but not the best) measure of "goodness of fit". But there is no justification that it can measure the goodness of out-of-sample prediction. If you split your data into training and testing parts and fit a regression model on the training one, you can get a valid R squared value on training part, but you can't legitimately compute an R squared on the test part. Some people did this, but I don't agree with it.

Here is very extreme example:

preds <- 1:4/4
actual <- 1:4

The R squared between those two vectors is 1. Yes of course, one is just a linear rescaling of the other so they have a perfect linear relationship. But, do you really think that the preds is a good prediction on actual??


In reply to wordsforthewise

Thanks for your comments 1, 2 and your answer of details.

You probably misunderstood the procedure. Given two vectors x and y, we first fit a regression line y ~ x then compute regression sum of squares and total sum of squares. It looks like you skip this regression step and go straight to the sum of square computation. That is false, since the partition of sum of squares does not hold and you can't compute R squared in a consistent way.

As you demonstrated, this is just one way for computing R squared:

preds <- c(1, 2, 3)
actual <- c(2, 2, 4)
rss <- sum((preds - actual) ^ 2)  ## residual sum of squares
tss <- sum((actual - mean(actual)) ^ 2)  ## total sum of squares
rsq <- 1 - rss/tss
#[1] 0.25

But there is another:

regss <- sum((preds - mean(preds)) ^ 2) ## regression sum of squares
regss / tss
#[1] 0.75

Also, your formula can give a negative value (the proper value should be 1 as mentioned above in the Warning section).

preds <- 1:4 / 4
actual <- 1:4
rss <- sum((preds - actual) ^ 2)  ## residual sum of squares
tss <- sum((actual - mean(actual)) ^ 2)  ## total sum of squares
rsq <- 1 - rss/tss
#[1] -2.375

Final remark

I had never expected that this answer could eventually be so long when I posted my initial answer 2 years ago. However, given the high views of this thread, I feel obliged to add more statistical details and discussions. I don't want to mislead people that just because they can compute an R squared so easily, they can use R squared everywhere.

Community
  • 1
  • 1
Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
  • 2
    There are some assumptions that must be met for that to be true: http://www.win-vector.com/blog/2011/11/correlation-and-r-squared/ I want to use it for KNN but I'm not sure the assumptions hold. Will have to check if this is actually the same as conventionally-calculated R^2 – wordsforthewise Jan 18 '18 at 06:10
  • 4
    The answer from the correlation ^2 approach and the conventional 1 - SSE/SST is not the same. In the example I am using, I got `0.756388` from the corr^2 and `0.7531449` from the normal r^2 equation. Small difference but still, maybe it would model for hyperparameter selection, etc. – wordsforthewise Jan 19 '18 at 00:52
  • maybe it would *matter*, not model – wordsforthewise Sep 03 '18 at 23:55
  • 4
    In multiple linear regression, R-squared is the squared correlation between response vector and fitted values. Try `model <- lm(trees); cor(trees[[1]], model$fitted.values) ^ 2`. Compare this with `summary(model)$r.squared` – Zheyuan Li Nov 22 '18 at 17:05
  • Thanks for the reply in your answer, but I'm still confused why we can't go straight to the RSS calculation. I will have to meditate on the partition SS for a while. I also don't see any reason R2 can't be negative -- doesn't that effectively mean our predictions are worse than using the average value of the targets for predcitions? – wordsforthewise Mar 13 '19 at 16:40
  • I have relied on `R2=cor(y,yhat,method='pearson')^2` to calculate (pseudo-)R2 values for virtually all regression models (random forests, lasso, etc). I use R2 because it intuitive and can be compared across predictive models with few caveats. I then (re)discovered that correlation is location/scale invariant, via a very bad 1:1 plot vs a good R2 value. The R2 changed from an incorrect 0.42 (correlation^2) to 0.05 (SSR/SST). You highlight the location/scale issue in your "Warning" section of the answer. I now always use an `R2=function(y,yhat)` custom function to calculate (pseudo-)R2. – kakarot May 07 '19 at 14:29
  • 2
    This answer seems wrong. You state that: "R squared between two vectors is just the square of their correlation". Unless I'm mistaken this is only true in case of a least squares linear regression with estimated intercept. In general the formula for Rsquared is `1 - Residual Sum of Squares / Total Sum of Squares`. (And yes this can yield negative results, which is just an indication that the estimate is worse than using the average as prediction.) – Willem Feb 18 '20 at 13:49
  • Agree with previous comment by @Willem. I think you mistaken two definitions of R2 - coefficient of determination and Pearson's coefficient squared. When you compute it on the same data used to estimate the model, you should get a value close to Pearson's squared whatever formula you use. But if you use different test data for the model (what seems to be the case in OP's questions) coefficient of determination can be very different and even negative, That would just mean that either your model is bad or the training and test data are quite different (batch effect or confounding factors). – yuk Jan 06 '21 at 20:18
  • This doesn't work if one of the predictors is a time variable. – mj_whales May 19 '21 at 02:42
  • I am really confused with the answers. Two different calculations of R2 yields very different results. ```calculate_rsq = function(preds, actual){ rss <- sum((preds - actual) ^ 2) ## residual sum of squares tss <- sum((actual - mean(actual)) ^ 2) ## total sum of squares rsq <- 1 - rss/tss return (rsq) } calculate_rsq(preds, actual)``` # 0.25 ```cor(preds, actual) ^ 2``` # 0.75 ```summary(lm(preds~actual))$r.squared``` # 0.75 – ibilgen Jun 10 '21 at 16:12
22

Why not this:

rsq <- function(x, y) summary(lm(y~x))$r.squared
rsq(obs, mod)
#[1] 0.8560185
Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63
  • 3
    This works, but is slow for large matrices: `lm()` (and `summary.lm()`) do a lot of unnecessary calculations. See https://stackoverflow.com/questions/25416413/is-there-a-faster-lm-function for more. – Martin Smith Apr 29 '21 at 08:51
17

It is not something obvious, but the caret package has a function postResample() that will calculate "A vector of performance estimates" according to the documentation. The "performance estimates" are

  • RMSE
  • Rsquared
  • mean absolute error (MAE)

and have to be accessed from the vector like this

library(caret)
vect1 <- c(1, 2, 3)
vect2 <- c(3, 2, 2)
res <- caret::postResample(vect1, vect2)
rsq <- res[2]

However, this is using the correlation squared approximation for r-squared as mentioned in another answer. I'm not sure why Max Kuhn didn't just use the conventional 1-SSE/SST.

caret also has an R2() method, although it's hard to find in the documentation.

The way to implement the normal coefficient of determination equation is:

preds <- c(1, 2, 3)
actual <- c(2, 2, 4)
rss <- sum((preds - actual) ^ 2)
tss <- sum((actual - mean(actual)) ^ 2)
rsq <- 1 - rss/tss

Not too bad to code by hand of course, but why isn't there a function for it in a language primarily made for statistics? I'm thinking I must be missing the implementation of R^2 somewhere, or no one cares enough about it to implement it. Most of the implementations, like this one, seem to be for generalized linear models.

wordsforthewise
  • 13,746
  • 5
  • 87
  • 117
  • 1
    This should be higher up. Also: r2_general <-function(preds,actual){ return(1- sum((preds - actual) ^ 2)/sum((actual - mean(actual))^2))} – JoeTheShmoe Sep 02 '20 at 23:56
14

You can also use the summary for linear models:

summary(lm(obs ~ mod, data=df))$r.squared 
Maria
  • 169
  • 1
  • 4
5

Here is the simplest solution based on [https://en.wikipedia.org/wiki/Coefficient_of_determination]

# 1. 'Actual' and 'Predicted' data
df <- data.frame(
  y_actual = c(1:5),
  y_predicted  = c(0.8, 2.4, 2, 3, 4.8))

# 2. R2 Score components

# 2.1. Average of actual data
avr_y_actual <- mean(df$y_actual)

# 2.2. Total sum of squares
ss_total <- sum((df$y_actual - avr_y_actual)^2)

# 2.3. Regression sum of squares
ss_regression <- sum((df$y_predicted - avr_y_actual)^2)

# 2.4. Residual sum of squares
ss_residuals <- sum((df$y_actual - df$y_predicted)^2)

# 3. R2 Score
r2 <- 1 - ss_residuals / ss_total
Andrii
  • 2,843
  • 27
  • 33
4

Not sure why this isn't implemented directly in R, but this answer is essentially the same as Andrii's and Wordsforthewise, I just turned into a function for the sake of convenience if somebody uses it a lot like me.

r2_general <-function(preds,actual){ 
  return(1- sum((preds - actual) ^ 2)/sum((actual - mean(actual))^2))
}
JoeTheShmoe
  • 433
  • 6
  • 13