I've got a script that looks like this:
#This is the master script. It runs all other scripts.
rm(list=ls())
#Run data cleaing script
source("datacleaning.R")
set.seed(413) #Seed pre-selected as lead author's wife's birthday (April 13th)
reps=128
#Make imputated datasets
source("makeimps.R")
#Model selection step 1.
source("model_selection.1.R")
load("AIC_results.1")
AIC_results
#best model removed the year interaction
#Model selection step 2. removed year interaction
source("model_selection.2.R")
load("AIC_results.2")
AIC_results
#all interactions pretty good. keeping this model
#Final selected model:
source("selectedmodel.R")
I send this master script to a supercomputing cluster; it takes about 17 hours of CPU time and 40 minutes of walltime on 32 cores. (Hence my non-reproducible example). But when I run the script, look at the results, then run it again, and look at the results again, they are slightly different. Why? I set the seed! Does the seed get reset somehow? Do I need to specify the seed inside of each script file?
I need to increase the number of reps, because its clear that I haven't converged sufficiently. But that's a separate issue. Why are my results here not reproducing themselves and how do I fix?
Thanks in advance.
EDIT: I'm doing the parallelization through doMC
and plyr
. Some light googling based on comments below turns up the fact that one can't really set a "parallel seed" using these packages. I'd need to migrate my code to SNOW
somehow. If anyone knows a solution with doMC
and plyr
, I'd be grateful to learn what it is.