4

I'm using the R wrapper for XGBoost. In the function xgb.cv, there is a folds parameter with the description

list provides a possibility of using a list of pre-defined CV folds (each element must be a vector of fold's indices). If folds are supplied, the nfold and stratified parameters would be ignored.

So, do I just specify the indices for training the model and assume the rest will be for testing? For example, if my training data is something like

    Feature1 Feature2 Target
 1:        2       10     10
 2:        7        1      9
 3:        8        2      3
 4:        8       10      7
 5:        8        2      9
 6:        3        7      3

and I want to cross validate using (train, test) indices as ((1,2,3), (4,5,6)) and ((4,5,6), (1,2,3)) do I set folds=list(c(1,2,3), c(4,5,6))?

Ben
  • 20,038
  • 30
  • 112
  • 189
  • One of `caret::createFolds` or `caret::createDataPartition` would do the hard work for you. Your example is probably correct. – m-dz Jul 10 '16 at 00:36

3 Answers3

4

Through some trial and error I figured out that xgboost is using the passed indices as indices of the test folds. Confirmed this by noticing the current devel version of xgboost explicitly states it in the documentation.

Ben
  • 20,038
  • 30
  • 112
  • 189
  • custom folds are giving results very different than regular CV. To describe what I am doing: the `xgb.DMatrix` has 2000 rows. `folds` should have test indices. So, `folds=list(1000:2000, 1500:2000, 1750:2000)`. It is doing in-the-bag predictions... – xm1 Jul 11 '19 at 00:16
3

Here is an example for both generating the folds and using them.

Assume in our dataframe we have a column of ids, such that we want to put all rows with a given id value in a fold.

The code below

  • finds the unique ids
  • preallocates a list for the folds
  • iterates over ids, creating lists of row indices that match

    fold.ids <- unique(df$id) custom.folds <- vector("list", length(fold.ids)) i <- 1 for( id in fold.ids){ custom.folds[[i]] <- which( df$id %in% id ) i <- i+1 }

Here is an example using the above fold list in xgb.cv

res <- xgb.cv(param, dtrain, nround, folds=custom.folds, prediction = TRUE)

Reasonable values for other xgb.cv parameters can be found in the documentation

Andrew Olney
  • 691
  • 4
  • 12
2

This worked best for me:

custom.folds <- caret::createFolds(data$Label, k=10, list=T)

xgbcv <- xgb.cv(
  params = params
  ,data = df
  ,maximize = F
  ,prediction = T
  ,metrics = "logloss"
  ,folds = custom.folds
)
Philipp G.
  • 21
  • 2