9

I recently found out about the folds parameter in xgb.cv, which allows one to specify the indices of the validation set. The helper function xgb.cv.mknfold is then invoked within xgb.cv, which then takes the remaining indices for each fold to be the indices of the training set for the respective fold.

Question: Can I specify both the training and validation indices via any interfaces in the xgboost interface?

My primary motivation is performing time-series cross validation, and I do not want the 'non-validation' indices to be automatically assigned as the training data. An example to illustrate what I want to do:

# assume i have 100 strips of time-series data, where each strip is X_i
# validate only on 10 points after training
fold1:  train on X_1-X_10, validate on X_11-X_20
fold2:  train on X_1-X_20, validate on X_21-X_30
fold3:  train on X_1-X_30, validate on X_31-X_40
...

Currently, using the folds parameter would force me to use the remaining examples as the validation set, which greatly increases the variance of the error estimate since the remaining data greatly outnumber the training data and may have a very different distribution from the training data especially for the earlier folds. Here's what I mean:

fold1:  train on X_1-X_10, validate on X_11-X100 # huge error
...

I'm open to solutions from other packages if they are convenient (i.e. wouldn't require me to pry open source codes) and do not nullify the efficiencies in the original xgboost implementation.

JP_smasher
  • 981
  • 12
  • 15
  • Did you ever figure this out I have a similiar issue – B_Miner Mar 22 '16 at 23:11
  • @B_Miner nope, I had to implement it by invoking xgboost every time I trained a model for each validation segment. – JP_smasher Mar 24 '16 at 02:04
  • Maybe this link is helpful. https://stackoverflow.com/questions/38287223/how-to-use-custom-cross-validation-folds-with-xgboost – KST Oct 03 '17 at 10:40
  • the caret package can implement xgboost models and it has a createTimeSlices function which might help. See this [document](http://topepo.github.io/caret/data-splitting.html#data-splitting-for-time-series) for more info. – see24 Jul 23 '18 at 19:11

2 Answers2

3

I think the bottom part of the question is the wrong way round, should probably say:

force me to use the remaining examples as the training set

It also seems that the mentioned helper function xgb.cv.mknfold is not around anymore. Note my version of xgboost is 0.71.2.

However, it does seem that this could be achieved fairly straight-forward with a small modification of xgb.cv, e.g. something like:

xgb.cv_new <- function(params = list(), data, nrounds, nfold, label = NULL, 
          missing = NA, prediction = FALSE, showsd = TRUE, metrics = list(), 
          obj = NULL, feval = NULL, stratified = TRUE, folds = NULL, folds_train = NULL, 
          verbose = TRUE, print_every_n = 1L, early_stopping_rounds = NULL, 
          maximize = NULL, callbacks = list(), ...) {
  check.deprecation(...)
  params <- check.booster.params(params, ...)
  for (m in metrics) params <- c(params, list(eval_metric = m))
  check.custom.obj()
  check.custom.eval()
  if ((inherits(data, "xgb.DMatrix") && is.null(getinfo(data, 
                                                        "label"))) || (!inherits(data, "xgb.DMatrix") && is.null(label))) 
    stop("Labels must be provided for CV either through xgb.DMatrix, or through 'label=' when 'data' is matrix")
  if (!is.null(folds)) {
    if (!is.list(folds) || length(folds) < 2) 
      stop("'folds' must be a list with 2 or more elements that are vectors of indices for each CV-fold")
    nfold <- length(folds)
  }
  else {
    if (nfold <= 1) 
      stop("'nfold' must be > 1")
    folds <- generate.cv.folds(nfold, nrow(data), stratified, 
                               label, params)
  }
  params <- c(params, list(silent = 1))
  print_every_n <- max(as.integer(print_every_n), 1L)
  if (!has.callbacks(callbacks, "cb.print.evaluation") && verbose) {
    callbacks <- add.cb(callbacks, cb.print.evaluation(print_every_n, 
                                                       showsd = showsd))
  }
  evaluation_log <- list()
  if (!has.callbacks(callbacks, "cb.evaluation.log")) {
    callbacks <- add.cb(callbacks, cb.evaluation.log())
  }
  stop_condition <- FALSE
  if (!is.null(early_stopping_rounds) && !has.callbacks(callbacks, 
                                                        "cb.early.stop")) {
    callbacks <- add.cb(callbacks, cb.early.stop(early_stopping_rounds, 
                                                 maximize = maximize, verbose = verbose))
  }
  if (prediction && !has.callbacks(callbacks, "cb.cv.predict")) {
    callbacks <- add.cb(callbacks, cb.cv.predict(save_models = FALSE))
  }
  cb <- categorize.callbacks(callbacks)
  dall <- xgb.get.DMatrix(data, label, missing)
  bst_folds <- lapply(seq_along(folds), function(k) {
    dtest <- slice(dall, folds[[k]])
    if (is.null(folds_train))
      dtrain <- slice(dall, unlist(folds[-k]))
    else
      dtrain <- slice(dall, folds_train[[k]])
    handle <- xgb.Booster.handle(params, list(dtrain, dtest))
    list(dtrain = dtrain, bst = handle, watchlist = list(train = dtrain, 
                                                         test = dtest), index = folds[[k]])
  })
  rm(dall)
  basket <- list()
  num_class <- max(as.numeric(NVL(params[["num_class"]], 1)), 
                   1)
  num_parallel_tree <- max(as.numeric(NVL(params[["num_parallel_tree"]], 
                                          1)), 1)
  begin_iteration <- 1
  end_iteration <- nrounds
  for (iteration in begin_iteration:end_iteration) {
    for (f in cb$pre_iter) f()
    msg <- lapply(bst_folds, function(fd) {
      xgb.iter.update(fd$bst, fd$dtrain, iteration - 1, 
                      obj)
      xgb.iter.eval(fd$bst, fd$watchlist, iteration - 1, 
                    feval)
    })
    msg <- simplify2array(msg)
    bst_evaluation <- rowMeans(msg)
    bst_evaluation_err <- sqrt(rowMeans(msg^2) - bst_evaluation^2)
    for (f in cb$post_iter) f()
    if (stop_condition) 
      break
  }
  for (f in cb$finalize) f(finalize = TRUE)
  ret <- list(call = match.call(), params = params, callbacks = callbacks, 
              evaluation_log = evaluation_log, niter = end_iteration, 
              nfeatures = ncol(data), folds = folds)
  ret <- c(ret, basket)
  class(ret) <- "xgb.cv.synchronous"
  invisible(ret)
}

I have just added an optional argument folds_train = NULL and used that later on inside the function in this way (see above):

if (is.null(folds_train))
  dtrain <- slice(dall, unlist(folds[-k]))
else
  dtrain <- slice(dall, folds_train[[k]])

Then you can use the new version of the function, e.g. like below:

# save original version
orig <- xgboost::xgb.cv

# devtools::install_github("miraisolutions/godmode")
godmode:::assignAnywhere("xgb.cv", xgb.cv_new)

# now you can use (call) xgb.cv with the additional argument

# once you are done, or may want to switch back to the original version
# (if you restart R you will also be back to the original version):
godmode:::assignAnywhere("xgb.cv", orig)

So now you should be able to call the function with the extra argument, providing the additional indices for the training data.

Note that I have not had time to test this.

RolandASc
  • 3,863
  • 1
  • 11
  • 30
  • Can you give an example of how to use `xgb.cv_new`? Having difficulties to specify `folds_train` correctly. Thanks for your answer. – markus Jul 23 '18 at 08:01
  • `folds_train` should be specified in the same way as `folds`, i.e. a `list` where each element is a vector of indices. After the part of assigning `xgb.cv_new` over `xgb.cv`, you should call `xgb.cv`, which will be the modified version. I have edited my answer a little to be clearer. If it doesn't work, maybe you can show what kind of call you are doing? – RolandASc Jul 23 '18 at 09:46
  • Thanks for the feedback. Will give it a try and let you know how it goes. Regards – markus Jul 23 '18 at 10:24
  • Tried caret::createTimeSlices in the following way. Just assume we have 500 observations: `n<-500;train_folds<-createTimeSlices(seq_len(n),floor(n/3),floor(n/5),FALSE,floor(n/5))$train` If I pass `train_folds` to `xgb.cv_new` as `folds_train` argument the outcome of `xgb.cv_new(...)$folds` differs from `train_folds`. Also I am not sure if the train set is validated _"only on [...] points after training"_, as OP wrote. (Hope I was clear.) – markus Jul 24 '18 at 09:41
  • what you do for the `folds_train` argument looks ok to me. you still need to pass the `folds` argument at the same time, containing the desired indices for the validation set (not sure from your last sentence in the comment above whether you are doing this) – RolandASc Jul 24 '18 at 10:38
  • Thanks Roland. If you don't mind I could later edit your answer and add a example. Best – markus Jul 24 '18 at 13:03
  • Sure, that would be great. Glad if it works for you now – RolandASc Jul 24 '18 at 13:30
0

According to the xgboost::xgb.cv documentation, you can pass custom test indexes through the folds argument (which is NULL by default!). It needs to be passed as a list, where every element is a vector of indexes.

For example, if you wanted to do a time series kind of splitting, you could do:

create_test_idx <- function(size) {
  half_size <- round(size / 2)
  step <- round(0.1 * half_size)
  starts <- seq(from = half_size, to = size - step, by = step)
  return(lapply(starts, function(x) return(c(as.integer(x), as.integer(size)))))
}

my_custom_idx <- create_test_idx(nrow(my_train_data))

and then (for example),

xgbcv <- xgboost::xgb.cv(
    params = params,
    data = mydata,
    nrounds = 10000,
    folds = my_custom_idx,
    showsd = T,
    verbose = 0,
    early_stopping_rounds = 200,
    maximize = F
  )
eduardokapp
  • 1,612
  • 1
  • 6
  • 28
  • I get error ```'list' object cannot be coerced to type 'double'``` after using ```my_custom_idx```. Not sure if I am doing anything wrong. – Saurabh Apr 19 '23 at 21:32