3

This may be a silly question but I just can't find a package to do that...I know I can write some codes to get what I want but it would be nice to have a function to do it automatically!

So basically I want to do a k-fold cross-validation for a glm model. I want to automatically get the predictions of each validation set and the actual value too. So if I am doing a 10-fold CV, I want a function to return the 10 validation sets with the actual responses and predictions all together.

Thank you in advance!

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
zhifff
  • 199
  • 1
  • 4
  • 15

2 Answers2

14

As stated in the comments, caret makes cross-validation very easy. Just use the "glm" method, like so:

> library(caret)
> set.seed(2)
> dat <- data.frame(label=round(runif(100,0,5)),v1=rnorm(100),v2=rnorm(100))
> tc <- trainControl("cv",10,savePred=T)
> (fit <- train(label~.,data=dat,method="glm",trControl=tc,family=poisson(link = "log")))
100 samples
  2 predictors

No pre-processing
Resampling: Cross-Validation (10 fold) 

Summary of sample sizes: 90, 91, 91, 90, 90, 89, ... 

Resampling results

  RMSE  Rsquared  RMSE SD  Rsquared SD
  1.53  0.146     0.131    0.235      


> fit$finalModel$family

Family: poisson 
Link function: log 

> head(fit$pred)
      pred obs rowIndex .parameter Resample
1 2.684367   1        1       none   Fold01
2 2.165246   1       18       none   Fold01
3 2.716165   3       35       none   Fold01
4 2.514789   3       36       none   Fold01
5 2.249137   5       47       none   Fold01
6 2.328514   2       48       none   Fold01
David
  • 9,284
  • 3
  • 41
  • 40
  • 1
    Missed that you want to look at the results for each fold, just add the `savePred=T` option to your `trainControl` option and they'll be in the `pred` field of your model object. – David Aug 28 '13 at 21:54
  • one question: I want to fit a rate model (poisson model with offset) The original model I had using glm and the model I use your codes (train function then fit$finalModel) are very very different. any idea why? I am not sure whether "train" function can handle all kinds of glm – zhifff Aug 29 '13 at 18:59
  • I assumed you wanted logistic regression (thus my example), I edited my post to show the example with a poisson link. – David Aug 29 '13 at 23:45
  • yes thats exactly what I did too...but the model turn out to be very different from the model if I just fit the whole dataset using glm. maybe be the reason is because of the offset. – zhifff Aug 30 '13 at 13:48
  • my model would be something like y~offset(log(x1))+x2+x3 with poisson link. so it's actually a rate model with the response to be y/x1. – zhifff Aug 30 '13 at 13:49
  • caret is just a wrapper so you can use whatever paramters to glm that you would like. When you look at the help documentation and see `train(x, ...)` that `...` is saying that anything extra you pass into your train call will go as a parater into the method function you are calling. Thats why I was able to simply add `family=poisson(link = "log")` into the train call above. Long story short, add the exact parameters from your original model into your train call if you'd like and if you're still confused after that post a new question because it could be related to caret's autotuning. – David Aug 30 '13 at 14:13
2

I would suggest investigating cv.glm from package boot, because you are working with a glm model. Another option would be package cvTools. I've found it more useful to write up my own function for CV, though. It sounds like you want a CV function that ends halfway, and most CV functions I've seen will average the prediction error over all the validation sets and return just the average (which, of course, is the definition of cross validation).

Daniel Watkins
  • 1,631
  • 14
  • 16
  • any idea how to save the prediction using cv.glm? It;s kind of troublesome to write it my own especially I wanted to do a leave one out CV... – zhifff Aug 29 '13 at 19:00
  • I wrote my own code . I have a sample with 4000 records. doing a leave one out taking a very long time...anyone knows a more efficient way..? – zhifff Aug 29 '13 at 20:32