24

I know there is a small difference between $sigma and the concept of root mean squared error. So, i am wondering what is the easiest way to obtain RMSE out of lm function in R?

res<-lm(randomData$price ~randomData$carat+
                     randomData$cut+randomData$color+
                     randomData$clarity+randomData$depth+
                     randomData$table+randomData$x+
                     randomData$y+randomData$z)

length(coefficients(res))

contains 24 coefficient, and I cannot make my model manually anymore. So, how can I evaluate the RMSE based on coefficients derived from lm?

jeramy townsley
  • 240
  • 3
  • 18
Jeff
  • 7,767
  • 28
  • 85
  • 138
  • 13
    strongly recommend `lm(price ~ carat + color + ..., data=randomData)`: easier to read, works with downstream methods such as `predict()`. – Ben Bolker Mar 30 '17 at 16:46

5 Answers5

36

Residual sum of squares:

RSS <- c(crossprod(res$residuals))

Mean squared error:

MSE <- RSS / length(res$residuals)

Root MSE:

RMSE <- sqrt(MSE)

Pearson estimated residual variance (as returned by summary.lm):

sig2 <- RSS / res$df.residual

Statistically, MSE is the maximum likelihood estimator of residual variance, but is biased (downward). The Pearson one is the restricted maximum likelihood estimator of residual variance, which is unbiased.


Remark

  • Given two vectors x and y, c(crossprod(x, y)) is equivalent to sum(x * y) but much faster. c(crossprod(x)) is likewise faster than sum(x ^ 2).
  • sum(x) / length(x) is also faster than mean(x).
Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
  • The code as it is in the first line cannot be run. There is 3 ')' but only 2 '('. I have tried to run without the last ')' but didn't work. it gets the following error: "unused argument (crossprod(fit2$residuals))" – André Costa Aug 25 '18 at 18:09
  • 2
    @AndréCosta Just see that you have another issue. You must have defined your own `c` function in your R session (which is a bad idea), masking `base::c` for vector concatenation. Here is how I reproduce your error: `c <- function () NULL` and then `c(crossprod(1:10))` – Zheyuan Li Aug 25 '18 at 19:01
  • My c function was fine. Maybe something as change in some version of R. For me the way worked was crossprod(1:10)[1]. (btw, it wasn't me who upvote, hahaha) – André Costa Aug 25 '18 at 19:47
  • 1
    @AndréCosta There are other alternatives like `drop(crossprd(1:10))`. Ah, sorry for that. Maybe that person also voted on the other answer here :) – Zheyuan Li Aug 25 '18 at 19:52
  • You can also calculate MSE=mean(res$residuals^2) – Kenan Sep 03 '21 at 02:18
32

To get the RMSE in one line, with just functions from base, I would use:

sqrt(mean(res$residuals^2))
comshak
  • 623
  • 8
  • 8
5

I think the other answers might be incorrect. The MSE of regression is the SSE divided by (n - k - 1), where n is the number of data points and k is the number of model parameters.

Simply taking the mean of the residuals squared (as other answers have suggested) is the equivalent of dividing by n instead of (n - k - 1).

I would calculate RMSE by sqrt(sum(res$residuals^2) / res$df).

The quantity in the denominator res$df gives you the degrees of freedom, which is the same as (n - k - 1). Take a look at this for reference: https://www3.nd.edu/~rwilliam/stats2/l02.pdf

Arthur
  • 1,248
  • 8
  • 14
  • 1
    If you look, you'll find plenty of other sources defining RMSE as exactly what it sounds like, the root mean squared error. Square the errors, take the mean, take the square root. The [wikipedia page](https://en.wikipedia.org/wiki/Root-mean-square_deviation), for example, doesn't mention anything about degrees of freedom or model parameters. – Gregor Thomas Nov 07 '18 at 19:52
  • 1
    Wikipedia covers it here: https://en.wikipedia.org/wiki/Mean_squared_error#Regression. The MSE definition with the degrees of freedom is also in my stats textbook by Hayter, "Probability and Statistics for Engineers and Scientists." – Arthur Nov 07 '18 at 20:40
0

Just do

sigma(res) 

An you got it

  • `sigma()` returns the "the estimated standard deviation of the errors (“residual standard deviation”) for Gaussian models" (help(sigma), not the root mean squared error. – Richard Careaga Jan 21 '21 at 06:33
0

Checkout the rmse() function in the Metrics package

jans
  • 255
  • 1
  • 3
  • 8