2

I am trying to understand how Caret control setting works. I am running some experiments using cross-validation via Caret control function e.g.

fitControl <- trainControl(## 10-fold CV
                           method = "repeatedcv",
                           number = 10,
                           ## repeated ten times
                           repeats = 10)

or

control <- rfeControl(functions=rfFuncs, method="repeatedcv", number=5, repeats = 5)

My question is that if I set some seed number before I run experiments i.e.

set.seed(5432)
control <- trainControl(...)
results <- train(..., control)
...

Does it guarantee that each fold contains exactly the same samples every time I run an experiment? For example, say I have samples with id = {1:100} and with Caret 10-fold cross-validation, my folds are: fold1 = {1:10}, fold2 = {11:20}, ..., fold10 = {91:100}. My question is if I rerun the experiment using the same seed number, my folds are still exactly the same as the previous run?

I know setting seed number helps with reproducibility, but I just need a confirm answer that that is what exactly what happens.

Many thanks,

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
user1480478
  • 584
  • 3
  • 15

1 Answers1

3

There are 2 ways of setting the seed for reproducibility.

  1. calling set.seed just before the train function.
  2. setting the seed inside trainControl (or rfeControl)

For more info on option 2 check the help, but also this SO question

More detailed information is available on the training page from caret website, section Notes on Reproducibility

Community
  • 1
  • 1
phiver
  • 23,048
  • 14
  • 44
  • 56
  • I have checked the docs you mentioned, but it doesn't say explicitly that the sampling folds will be exactly the same. I tried running the code twice using the same seed number (tested with both options you mentioned), and after inspection, the rowIndex between the two runs are different. – user1480478 Aug 02 '16 at 14:18
  • 2
    If you set the seed to be the same number just before running `train`, you will get the same resamples. – topepo Aug 02 '16 at 15:40
  • @topepo You are absolutely correct! What I have learnt is that you need to actually run the set.seed() every single time to rerun the experiment. I call the set.seed at the beginning of the code and I thought it is set to this number for the whole time until I change it. Completely wrong there. Thanks again peep! :D – user1480478 Aug 02 '16 at 17:32