3

As far as I can tell, both Stata and R have a "predict" function. I'm trying to replicate results that were performed in Stata using R, and the results involve calculating standard deviations of predicted values. Is there a functionality in R, maybe using its "predict" function, that will allow me to do this? I can't seem to replicate the results perfectly. In case it helps, the Stata code does the following:

reg Y X1 X2 if condition
predict resid, r
predict stdf, stdf

The definition of the stdf argument is:

stdf calculates the standard error of the forecast, which is the standard error of the point prediction for 1 observation. It is commonly referred to as the standard error of the future or forecast value. By construction, the standard errors produced by stdf are always larger than those produced by stdp; see Methods and formulas in [R] predict

And the R code I've been writing is:

fit <- lm(Y ~ X1 + X2, data=df)
new.df <- data.frame(...) # This is a new data frame with my new data period I want to predict in
predict(fit, new.df, se.fit = TRUE)

However, when I convert the standard errors to standard deviations, they don't match the Stata output.

Thanks in advance!

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
brian
  • 121
  • 2
  • 12
  • 1
    not quite sure what you mean. (1) can you provide a reproducible example? (2) are you possibly looking for the difference between a *confidence* and a *prediction* interval (the latter includes the effect of residual variation), in which case see the `interval` argument to `?predict.lm` ... ? – Ben Bolker Aug 31 '15 at 21:19
  • 2
    e.g. see http://stackoverflow.com/questions/9406139/r-programming-predict-prediction-vs-confidence/9406534#9406534 , http://stackoverflow.com/questions/15211384/any-simple-way-to-get-regression-prediction-intervals-in-r , etc. – Ben Bolker Aug 31 '15 at 21:22
  • Unfortunately, I can't provide the exact example because it is confidential. However, maybe I can generalize the question a little bit. Is there a way in R to calculate the residuals and standard deviations on predicted values, much like the "predict" function in Stata? Say you are given a control period (for regression) and a test period (for predicting). Can you get errors on your predicted y-values? Thank you! – brian Aug 31 '15 at 21:28
  • 2
    (1) Please read http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example for advice on how to create a reproducible example; some of the answers discuss ideas for MWEs when the data themselves cannot be shared. (2) For those of us who don't know what Stata is actually calculating, you need to provide us with a clear, concrete definition (I just copied that for you from the PDF) – Ben Bolker Aug 31 '15 at 21:47

2 Answers2

5

Looks to me that you need:

 predict(fit, new.df, se.fit = TRUE, interval="prediction")

"Standard errors" apply to the confidence limits around the estimate of the mean, while prediction errors might easily be described as "standard deviations" around predictions.

> dfrm <- data.frame(a=rnorm(30), drop=FALSE)
> dfrm$y <- 4+dfrm$a*5+0.5*rnorm(30)
> plot( dfrm$a, predict(mod) )
> plot( dfrm$a, predict(mod, newdata=dfrm) )
> points( rep(seq(-2,2,by=0.1),2),   # need two copies for upper and lower
          c(predict(mod, newdata=list(a=seq(-2,2,by=0.1)), 
                         interval="prediction")[, c("lwr","upr")]), 
          col="red")
> points(dfrm$a, dfrm$y, col="blue" )

enter image description here

IRTFM
  • 258,963
  • 21
  • 364
  • 487
2

Following @BondedDust's example: he shows how to get prediction intervals (+/- 1.96*std_dev). In principle you could recover the

 set.seed(1001)
 dfrm <- data.frame(a=rnorm(30), drop=FALSE)
 dfrm$y <- 4+dfrm$a*5+0.5*rnorm(30)

Fit model:

 mod <- lm(y ~ a, data=dfrm)

Predict:

 pframe <- data.frame(a=seq(-2,2,by=0.1))
 pred <- predict(mod,newdata=pframe,se.fit=TRUE)
 pframe$y <- pred$fit
 pframe$se <- pred$se.fit
 pframe$sd <- sqrt(pred$se.fit^2+sigma(mod)^2)

Results:

 head(pframe,3)
 ##      a         y        se        sd
 ## 1 -2.0 -5.877806 0.2498792 0.7319977
 ## 2 -1.9 -5.380531 0.2403656 0.7288049
 ## 3 -1.8 -4.883256 0.2309916 0.7257673

Check against the prediction interval:

 pred2 <- predict(mod,newdata=pframe,interval="predict")
 wid <- qt(0.975,df=pred$df)
 all.equal(unname(pred2[,"lwr"]),with(pframe,y-wid*sd))  ## TRUE
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453