0

I am using the caret package in R for some supervised multivariate analysis. I am trying to add some functionality to my script that will allow for reproducible outcomes whenever the script is run.

I have this setup for using 2 classification models (each model is run separately, not as an enesmble):

library(caret)

load.data = ....
cleaned.data = cleaning(load.data)
mycontrol = trainControl(...)
train, test = createDatapartition(...)

model1 = train(...,
               data=train, ...,
               trControl=mycontrol,
               preprocess=c('center'))
model2 = train(...,
               data=train, ...,
               trControl=mycontrol,
               preprocess=c('pca'))

feature.importances = ...
summary(resamples(list(m1=model1,m2=model2)))
learing_curve_dat(...) #see link 1. below.
predict()
Evaluate(....) #see link 2. below

Where in this pipeline should I use set.seed(#) and what should # be in order to get reproducible outcomes each time the script is run - or do I just pick any value for # randomly?

Links:

1. 2.

jmuhlenkamp
  • 2,102
  • 1
  • 14
  • 37
edesz
  • 11,756
  • 22
  • 75
  • 123

1 Answers1

1

You should read the Notes on Reproducibility section on the package web page.

The seed number doesn't matter. I generate one with sample.int(100000, 1). Depending on how you are doing the model, you at least should set the seed just prior to calling train (but please read the link above).

topepo
  • 13,534
  • 3
  • 39
  • 52