1

I'm having trouble matching R's linear model confidence and prediction intervals for point estimates:

set.seed(3)
x <- rnorm(392, 20, 5)
y <- 2*x + 3 + rnorm(392, sd=3)

lm.fit <- lm(y~x)

(est <- as.vector(matrix(c(1,20),nrow=1,ncol=2) %*% lm.fit$coefficients))
(ci <- est + c(-1,1) * qt(.975, df=lm.fit$df.residual)*sqrt(var(lm.fit$residuals)/(lm.fit$df.residual)))
(pred <- predict(lm.fit,data.frame(x=(c(20))), interval="confidence", se.fit = TRUE))

sqrt(var(lm.fit$residuals)/(lm.fit$df.residual)) 
pred$se.fit

If you run the code above, you'll see that the point estimates are the same, but I get slightly different standard errors and therefore confidence intervals that are slightly off. How does R match the standard error?

I've seen in the documentation for predict.lm() (below) that R uses pred.var in the predict.lm function, but pred.var uses res.var, and doesn't say where res.var comes from. res.var also does not match the variance of the residuals:

var(lm.fit$residuals)

(http://127.0.0.1:25345/library/stats/html/predict.lm.html)

Thanks!

1 Answers1

2

You are confusing the prediction interval standard error (for a new observation) vs the confidence interval standard error (for the mean response).

Specifically, for prediction, the SE has an extra 1 in the expression of the standard error. I show the full calculation below.

See the code below:

set.seed(3)
x <- rnorm(392, 20, 5)
y <- 2*x + 3 + rnorm(392, sd=3)

lm.fit <- lm(y~x)

If this is a new observation you need a 1 in the scaling factor (see the first term

pred <- predict(lm.fit,data.frame(x=(c(20))), interval="prediction", se.fit = TRUE)
MSE <- sum(lm.fit$residuals^2)/(length(x)-2)
Scaling_factor = (1 + 1/length(x) + ((20 - mean(x))^2) / sum( (x- mean(x))^2     ))
est - qt(.975, length(x)-2)*sqrt(MSE*Scaling_factor)
est + qt(.975, length(x)-2)*sqrt(MSE*Scaling_factor)

If you are predicting the mean response (which is what linear regression usually does) you don't need that term.

pred <- predict(lm.fit,data.frame(x=(c(20))), interval="confidence", se.fit = TRUE)
MSE <- sum(lm.fit$residuals^2)/(length(x)-2)
Scaling_factor = (1/length(x) + ((20 - mean(x))^2) / sum( (x- mean(x))^2     ))
est - qt(.975, length(x)-2)*sqrt(MSE*Scaling_factor)
est + qt(.975, length(x)-2)*sqrt(MSE*Scaling_factor)

Notice how they perfectly now match up. For a mathematical analysis, take a look at this great writeup:

Linear Regression for the mean response and for an individual response

Specifically, for the mean response:

enter image description here

However, for an individual observation:

enter image description here

That extra "1" is what gives you the difference. The basic idea is that when you are predicting an individual response there is additional randomness as your estimate of the MSE calculates the variability around the average response value, not an individual response value.

Intuitively, this makes a lot of sense. As your number of observations -> infinity the 1 keeps the standard error from going to 0 (as there's always some variability for any one person).

Hope that helps!

user1357015
  • 11,168
  • 22
  • 66
  • 111