5

I am not able to set a seed value to get reproducible results from parallelSVM().

 library(e1071)
 library(parallelSVM)

 data(iris)
 x <- subset(iris, select = -Species)
 y <- iris$Species

set.seed(1)
model       <- parallelSVM(x, y)
parallelPredictions <- predict(model, x)

set.seed(1)
model2       <- parallelSVM(x, y)
parallelPredictions2 <- predict(model2, x)

all.equal(parallelPredictions,parallelPredictions2) 

I know that this is not the right way to set a seed value for multicore operations, but I have no clue what to do alternatively.

I know there is an option, when using mclapply, but that does not help in my situation.


Edit:
I have found a solution by changing the function trainSample() within the parallelSVM with a trace and the doRNG package for seeds with the foreach loop.

Does anybody know a better solution?

user3666197
  • 1
  • 6
  • 50
  • 92
Mirko
  • 89
  • 5

1 Answers1

2

In short, there is no implemented method in parallelSVM to handle this issue. However the package uses the foreach and doParallel packages to handle it's parallel operations. And digging hard enough on stackoverflow a solution is possible!

Credits to this answer, on the usage of the doRNG package, and this answer for giving me an idea for a simpler enclosed solution.

Solution:

In the parallelSVM package the parallelization happens through the parallelSVM::registerCores functions. This function simply calls doParallel::registerDoParallel with the number of cores, and no further arguments. My idea is simply to change the parallelSVM::registerCores function, such that it automatically sets the seed at after creating a new cluster.

When performing parallel computation, in which you need a parallel seed, there are 2 things you need to ensure

  1. The seed needs to be given to each node in the cluster
  2. The generator needs to be one that is asymptotically random across clusters.

Luckily the doRNG package handles the first and uses a seed that which is alright on 2. Using a combination of unlockNamespace and assign we can overwrite the parallelSVM::registerCores, such that it includes a call to doRNG::registerDoRNG with the appropriate seed (function at the end of answer). Doing this we can actually get proper reproducibility as illstrated below:

library(parallelSVM)
library(e1071)
data(magicData)
set.seed.parallelSWM(1) #<=== set seed as we would normally.
#Example from help(parallelSVM)
system.time(parallelSvm1 <- parallelSVM(V11 ~ ., data = trainData[,-1],
                                       numberCores = 4, samplingSize = 0.2, 
                                       probability = TRUE, gamma=0.1, cost = 10))
system.time(parallelSvm2 <- parallelSVM(V11 ~ ., data = trainData[,-1],
                                       numberCores = 4, samplingSize = 0.2, 
                                       probability = TRUE, gamma=0.1, cost = 10))
pred1 <- predict(parallelSvm1)
pred2 <- predict(parallelSvm2)
all.equal(pred1, pred2)
[1] TRUE
identical(parallelSvm1, parallelSvm2)
[1] FALSE

Note that identical does not have the power to properly asses the objects output by parallel::parallelSvm, and thus the predictions are better to check whether the models are identical.

For safety lets check if this is also the case for the reproducible example in the question

x <- subset(iris, select = -Species)
y <- iris$Species
set.seed.parallelSWM(1) #<=== set seed as we would normally (not necessary if above example has been run).
model       <- parallelSVM(x, y)
model2       <- parallelSVM(x, y)
parallelPredicitions <- predict(model, x)
parallelPredicitions2 <- predict(model2, x)
all.equal(parallelPredicitions, parallelPredicitions2)
[1] TRUE

Phew..

Last, if we are done, or if we wanted random seeds once again, we can reset the seed by executing

set.seed.parallelSWM() #<=== set seed to random each execution (standard).
#check:
model       <- parallelSVM(x, y)
model2       <- parallelSVM(x, y)
parallelPredicitions <- predict(model, x)
parallelPredicitions2 <- predict(model2, x)
all.equal(parallelPredicitions, parallelPredicitions2)
[1] "3 string mismatches"

(the output will vary, as the RNNG seed is not set)

set.seed.parallelSWM function

credits to this answer. Note that we might not have to double up on the assignment, but here i simply replicated the answer without checking if the code could be further reduced.

set.seed.parallelSWM <- function(seed, once = TRUE){
    if(missing(seed) || is.character(seed)){
        out <- function (numberCores) 
        {
            cluster <- parallel::makeCluster(numberCores)
            doParallel::registerDoParallel(cluster)
        }
    }else{
        require("doRNG", quietly = TRUE, character.only = TRUE)
        out <- function(numberCores){
            cluster <- parallel::makeCluster(numberCores)
            doParallel::registerDoParallel(cluster)
            doRNG::registerDoRNG(seed = seed, once = once)
        }
    }
    unlockBinding("registerCores", as.environment("package:parallelSVM"))
    assign("registerCores", out, "package:parallelSVM")
    lockBinding("registerCores", as.environment("package:parallelSVM"))
    unlockBinding("registerCores", getNamespace("parallelSVM"))
    assign("registerCores", out, getNamespace("parallelSVM"))
    lockBinding("registerCores", getNamespace("parallelSVM"))
    #unlockBinding("registerCores", as.environment("package:parallelSVM"))
    invisible()
}
Community
  • 1
  • 1
Oliver
  • 8,169
  • 3
  • 15
  • 37
  • Thank you very much for your very detailed answer. This is much more better then manipulating functions within the package. Very impressiv. Would be nice if your function would find its way into the parallelSVM package! – Mirko Sep 25 '19 at 02:06
  • Agreed, several packages could use similar tools. However the package was last updated in 2015, so I'd doubt that it would be implemented (Maybe a future project on my end, for a package project, hmm...). – Oliver Sep 25 '19 at 02:34
  • I have a small additional question. I am doing a loop to execute parallelSVMs on 180 small data sets. For every iteration a small "R for windows front end process" is added to my memory". Do you have an idea where this comes from and how to clear these processes after each iteration? Does parallelSVM starts a new cluster for every iteration or something? – Mirko Sep 25 '19 at 03:12
  • 1
    Yes indeed. `registerCores(...)` starts a new cluster at each call. Optimally it should be checking for an active cluster, and only starting one if one does not exists. For now, until i get some time adding `foreach::registerDoSEQ()` prior to `cluster <- parallel::makeCluster(numberCores)` within the `out` function, will close the cluster before it is reopened, closing any parallel session. Credits to the author of `doParallel` in his [answer here](https://stackoverflow.com/a/25110203/10782538). – Oliver Sep 25 '19 at 03:29
  • Atleast according to the answer. On testing it seems the sessions will only be closed upon the garbage cleaner running (which should happen within the next couple of calls). A warning is likely produced meanwhile. – Oliver Sep 25 '19 at 03:31
  • 1
    **Deep respect** for your help @MirkoLudewig and other users of quantitatively fair sciences to achieve methods for **principally repeatable models ( and responsible science at greater scale )** Worth +100 if we could, due to **high ethical reasons for responsible science**. All the best in your professional future and let us spread the ethics of repeatable ( reproducible ) scientific experiments ( not so generally visible in these days around the world - though possible, though necessary, though no better lege artis procedures anywhere near our sights. **[ Beliefs are simply not enough ]** – user3666197 Sep 25 '19 at 03:34
  • I am so confused @user3666197 ? My pleasure. ^^ – Oliver Sep 25 '19 at 03:38
  • 1
    @Oliver May I have a question that I cannot decide from my level of reading the R-package-management trick you have proposed above? Do you, be it accidentally or knowingly, set all cores' processes to have the same one seed value ( which will, though legally, reduce the sought for diversity of the PRNG-dependent generative models ), or do you seed each of these processes a bit different seed ( where N0, N0+1, N0+2, N0+3, ... will suffice, due to the nature of non-correlated outputs from the at least this way non-singularly, yet reproducibly, seeded pseudorandom number generator algorithm )? – user3666197 Sep 25 '19 at 04:18
  • @user3666197, it is a great question and one i sought the answer to in `help(doRNG::regiserDoRNG)` before posting my answer. The random number generator used in `doRNG::registerDoRNG` is stable across multiple streams (clusters), and can be reset for each cluster (in my function by setting `once = FALSE` when calling `set.seed.parallelSWM`). Actually theres an example the in [second reference](https://stackoverflow.com/a/43302297/10782538) which is exactly about this problem, containing an example for illustration. Basically the random number generator used, is specialized for parallelism. – Oliver Sep 25 '19 at 10:09
  • (continued). So setting `once = TRUE` ensures new random numbers for each seperate R session, while setting `once = FALSE` would, in my understanding from the help page, generate the same string of random numbers for each stream. – Oliver Sep 25 '19 at 10:11
  • Thx Oliver for kind directions.Unable to solve wonders from the sources: the reproducible,yet per-foreach-line-of-execution different seeding-syntax was the concern (having 4-identical sequences of pRNG-numbers produced in parallel (of interest for some cases)reduces randomness of generative models,while having 4-different(still deterministically reproducible)sequences of pRNG-numbers,produced in parallel does not skew the sought for randomness of generative models).Not being in R-domain, is something like this feasible: **`a1 <- foreach(i=1:4, .options.RNG=123+i) %dorng%{ GeneOps(...) }` ?** – user3666197 Sep 26 '19 at 03:17
  • Luckily it takes care of the multi-session RNG properly (The RNG is specifically designed for it). IT is not to my knowledge an implemented method, however it could be programmed to work like that. However it would likely make more sense to just set seed in the options and continue using `%dopar%`. No it is not implemented, yet it could be programmed. – Oliver Sep 26 '19 at 10:48