1

Almost all of the machine learning packages / functions in R allow you to obtain cross-validation performance metrics while training a model.

From what I can tell, the only way to do cross-validation with xgboost is to setup a xgb.cv statement like this:

clf <- xgb.cv(      params              = param, 
                    data                = dtrain, 
                    nrounds             = 1000,
                    verbose             = 1,
                    watchlist           = watchlist,
                    maximize            = FALSE,
                    nfold               = 2,
                    nthread             = 2,
                    prediction          = T
)

but even with that option of prediction = T you're merely getting the prediction results from your training data. I don't see a way to use the resulting object (clf in this example) in a predict statement with new data.

Is my understanding accurate and is there any work-around?

  • A comment on the downvote would be appreciated so that I can make the post better. –  May 01 '16 at 23:49
  • Not the downvoter but wouldn't the answer just be `xgb.save(bst, "xgboost.model")` where bst is the result from `xgb.train()` and then load and predict with a new dataset? Saving the cross-validation results doesn't seem that useful to my understanding of your goals. – IRTFM May 16 '16 at 01:43

1 Answers1

2

I believe your understanding is accurate, and that there is no setting to save the models from cross validation.

For more control over cross validation, you can train xgboost models with caret (see more details on the trainControl function here http://topepo.github.io/caret/training.html)

Yet unless I'm mistaken, caret also lacks an option to save each CV model for use to predict later on (although you can manually specify metrics you wish to evaluate them on). Depending on what your reason for using the CV models to predict on new data, you could either 1) retrieve the indices of the CV models from the final model, to retrain that particular one model (without crossvalidation, but with the same seed) on just that subset of the data (from the $control$index list within the object produced by caret's train function:

> library(MASS) # For the Boston dataset
> library(caret)
> ctrl <- trainControl(method = "cv", number = 3, savePred=T)
> mod <- train(medv~., data = Boston, method = "xgbLinear", trControl = ctrl)
> str(mod$control$index)

List of 3
 $ Fold1: int [1:336] 2 3 4 6 8 9 13 14 17 19 ...
 $ Fold2: int [1:338] 1 2 4 5 6 7 9 10 11 12 ...
 $ Fold3: int [1:338] 1 3 5 7 8 10 11 12 14 15 ...

or 2) manually cross-validate with lapply or a for loop to save all the models you create. The createFolds family of functions in caret is a useful tool for choosing the cross validation folds.

Michael Veale
  • 929
  • 4
  • 11
  • Thanks for your answer. I am just trying to get *a* model that I can use for prediction, not one specific to any particular CV fold. I just don't want to double the time I spend training models. Actually with `caret` almost all of the model types provide this functionality. The reason I didn't think that caret was a solution was that the last time I checked xgb had only been implemented with extremely limited functionality. It seems they've improved the tuning capabilities a lot in the past year. Let me check on this and I may mark this as the answer depending on what I find. – Hack-R Apr 09 '16 at 23:30