1

I've got a script that looks like this:

#This is the master script.  It runs all other scripts.
rm(list=ls()) 

#Run data cleaing script
source("datacleaning.R")

set.seed(413) #Seed pre-selected as lead author's wife's birthday (April 13th)
reps=128

#Make imputated datasets
source("makeimps.R")

#Model selection step 1.  
source("model_selection.1.R")
load("AIC_results.1")
AIC_results

#best model removed the year interaction

#Model selection step 2.  removed year interaction
source("model_selection.2.R")
load("AIC_results.2")
AIC_results

#all interactions pretty good.  keeping this model

#Final selected model:
source("selectedmodel.R")

I send this master script to a supercomputing cluster; it takes about 17 hours of CPU time and 40 minutes of walltime on 32 cores. (Hence my non-reproducible example). But when I run the script, look at the results, then run it again, and look at the results again, they are slightly different. Why? I set the seed! Does the seed get reset somehow? Do I need to specify the seed inside of each script file?

I need to increase the number of reps, because its clear that I haven't converged sufficiently. But that's a separate issue. Why are my results here not reproducing themselves and how do I fix?

Thanks in advance.

EDIT: I'm doing the parallelization through doMC and plyr. Some light googling based on comments below turns up the fact that one can't really set a "parallel seed" using these packages. I'd need to migrate my code to SNOW somehow. If anyone knows a solution with doMC and plyr, I'd be grateful to learn what it is.

generic_user
  • 3,430
  • 3
  • 32
  • 56
  • 1
    Look in those RData files that you seem to be saving. In any of them, is there a `.Random.seed` object? Try using the answers to [this question](http://stackoverflow.com/q/4831050/1465387). – sebastian-c Apr 10 '13 at 05:17
  • 4
    If any of the files you are sourcing use the R multicore package, have a look at this: http://stats.stackexchange.com/a/3534 – hrbrmstr Apr 10 '13 at 05:21
  • @hrbrmstr: thats it. got some digging to do. thanks. – generic_user Apr 10 '13 at 05:29
  • You want to (at least) read the vignette of the `parallel` package that has been part of Base R for a while. – Dirk Eddelbuettel Apr 10 '13 at 11:36

1 Answers1

2

Look at the doRNG package, specifically developed for this kind of reproducible parallel computing. Set the seed inside the call to the loop and you will be able to reproduce your results exactly...

require(doParallel)
require(doRNG)
cl <- makeCluster(4)
registerDoParallel(cl)


unlist( foreach( i = 1:4 , .options.RNG = 413 ) %dorng% { runif(1) } )
#[1] 0.5251507 0.4326805 0.6409496 0.5523651

unlist( foreach( i = 1:4 , .options.RNG = 413 ) %dorng% { runif(1) } )
#[1] 0.5251507 0.4326805 0.6409496 0.5523651 
Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184
  • 1
    But this approach won't work with `plyr`, will it? Is there a way to ask `plyr` to use '%dorng%' rather than '%dopar%'? – Steve Weston Apr 10 '13 at 19:15
  • OP here: ^ my question exactly. – generic_user Apr 10 '13 at 22:32
  • @SteveWeston @ACD I do not know `plyr` so well. It would beneficial if you could post some code that shows a trivial example of what you are doing with `plyr` to see if it could be ported to use `%dorng%` instead. One thought that occurs, is that instead of using `plyr` to split the jobs across cores, you split the job using `%dorng%` and then use `plyr` within each job to process the data on a single core? Can you partition your data before it is used by `plyr`? – Simon O'Hanlon Apr 11 '13 at 08:27
  • These suggestions make a lot of sense to me. Many uses of `plyr` translate easily to `foreach`, and then `%dorng%` can be used as in this answer. It's not easy to guarantee reproducible results even in `snow`, which is one of the reasons for the old `snowFT` package. – Steve Weston Apr 11 '13 at 14:17