4

I'm trying to limit the execution time of an analysis, however I want to keep what the analysis already did. In my case I'm running xgb.cv (from xgboost R package) and I want to keep all iterations until the analysis reach 10 seconds (or "n" seconds/minutes/hours).

I've tried the approach mentioned in this thread but it stops after it reaches 10 secs without keeping the iterations previously done.

Here is my code:

require(xgboost)
require(R.utils)

data(iris)
train.model <- model.matrix(Sepal.Length~., iris)

dtrain <- xgb.DMatrix(data=train.model, label=iris$Sepal.Length)

evalerror <- function(preds, dtrain) {
  labels <- getinfo(dtrain, "label")
  err <- sqrt(sum((log(preds) -  log(labels))^2)/length(labels))
  return(list(metric = "error", value = err))}

xgb_grid = list(eta = 0.05, max_depth = 5, subsample = 0.7, gamma = 0.3,
  min_child_weight = 1)

fit_boost <- tryCatch(
            expr = {evalWithTimeout({xgb.cv(data  = dtrain,
                  nrounds     = 10000,
                  objective   = "reg:linear",
                  eval_metric = evalerror, 
                  early_stopping_rounds = 300,
                  print_every_n = 100,
                  params = xgb_grid,
                  colsample_bytree = 0.7, 
                  nfold = 5,
                  prediction = TRUE,
                  maximize = FALSE
                  )}, 
                  timeout = 10)
                  },                                        
            TimeoutException = function(ex) cat("Timeout. Skipping.\n"))

and the output is

#Error in dim.xgb.DMatrix(x) : reached CPU time limit

Thank you!

data princess
  • 1,130
  • 1
  • 23
  • 42
patL
  • 2,259
  • 1
  • 17
  • 38
  • Can you parallelize your task? If so learn about parallel::parLapply – Andre Elrico Sep 26 '17 at 14:53
  • @Andre. I'm familiar with parallelization but that's not what I want. Thank you. – patL Sep 26 '17 at 14:59
  • What about using a while-loop in some way, where you record the time with sys.time() after each iteration and stop when the difference reaches 10 seconds? – KenHBS Sep 26 '17 at 15:15
  • @Ken. Thank you for your comment. The problem is that I don't know how to keep the iterations within `xgb.cv` (or `xgb.train`) before it timeouts. – patL Sep 27 '17 at 07:23
  • @patL I updated my answer. Might be a little closer to what you and future readers are trying to do. Cheers! – data princess Oct 16 '17 at 22:04
  • @dataprincess Thank you again for take your time. The thing is that I want to get all the iterations performed before it timeouts. ;) – patL Oct 17 '17 at 08:59
  • Of course, of course. Well, one step at a time, eh? – data princess Oct 17 '17 at 13:38
  • @dataprincess Sure and thank you for your help – patL Oct 17 '17 at 13:53

1 Answers1

1

Edit - slightly closer to what you want:

Wrap the whole thing with R's capture.output() function. This will store all the evaluation output as an R object. Again, I think you're looking for something more, but this is at least local and malleable. Syntax:

fit_boost <- capture.output(tryCatch(expr = {evalWithTimeout({...}) ) )
> fit_boost
 [1] "[1]\ttrain-error:2.033160+0.006109\ttest-error:2.034180+0.017467 "  ...

Original answer:

You could also use a sink. Simply add this line before you start doing the cross validation:

sink("evaluationLog.txt")
fit_boost <- tryCatch(
expr = {evalWithTimeout({xgb.cv(data  = dtrain,
                              nrounds     = 10000,
                              objective   = "reg:linear",
                              eval_metric = evalerror, 
                              early_stopping_rounds = 300,
                              print_every_n = 100,
                              params = xgb_grid,
                              colsample_bytree = 0.7, 
                              nfold = 5,
                              prediction = TRUE,
                              maximize = FALSE
)}, 
timeout = 10)
},                                        
TimeoutException = function(ex) cat("Timeout. Skipping.\n"))
sink()

Where the sink() at the end would normally return output to the console, but in this case it won't because an error is thrown. But once you run this, you can open up evaluationLog.txt and viola:

[1] train-error:2.033217+0.003705   test-error:2.032427+0.012808 
Multiple eval metrics are present. Will use test_error for early stopping.
Will train until test_error hasn't improved in 300 rounds.

[101]   train-error:0.045297+0.000396   test-error:0.060047+0.001849 
[201]   train-error:0.042085+0.000852   test-error:0.059798+0.002382 
[301]   train-error:0.041117+0.001032   test-error:0.059733+0.002701 
[401]   train-error:0.040340+0.001170   test-error:0.059481+0.002973 
[501]   train-error:0.039988+0.001145   test-error:0.059469+0.002929 
[601]   train-error:0.039698+0.001028   test-error:0.059416+0.003018 

This isn't perfect, of course. I imagine you want to perform some operations on these and this isn't exactly the best format. However, it's not a tall order to convert this into something more manageable. I haven't yet found a way to save the actual xgb.cv$evaluation_log object before the timeout. That is a very good question.

data princess
  • 1,130
  • 1
  • 23
  • 42
  • thank you so much. This is helpful for what I need, despite I can not keep results "inside" `R` Session. It would be great if I get this `evaluationLog` in R so I could handle it. Voted up. – patL Oct 11 '17 at 08:26
  • No problem. For the record I misspelled it in my original post, there's an underscore. But yeah, it's a very interesting issue and I'll keep looking into it. That would be a very nice thing to be able to have saved in R. – data princess Oct 11 '17 at 14:13