3

I'm a beginner to parallel computing in R. I came across the doParallel package and I thought it might be useful in my case.

The following code aims at evaluating in parallel several pglm regressions:

require("foreach")
require("doParallel")

resVar <- sample(1:6,100,TRUE)
x1     <- 1:100
x2     <- rnorm(100)
x3     <- rchisq(100, 2, ncp = 0)
x4     <- rweibull(100, 1, scale = 1)
Year   <- sample(2011:2014,100,replace=TRUE)
X      <- data.frame(resVar,x1,x2,x3,x4,Year)

facInt = 1:4 # no factors
#find all possible combinations
cmbList <- lapply(2, function(nbFact) {
   allCmbs <- t(combn(facInt, nbFact))
   dupCmbs <- combn(1:4, nbFact, function(x) any(duplicated(x)))
   allCmbs[!dupCmbs, , drop = FALSE] })

noSubModel   <- c(0, sapply(cmbList, nrow))
noModel      <- sum(noSubModel)
combinations <- cmbList[[1]]
factors      <- X[,c("x1","x2","x3","x4")]
coeff_vars   <- matrix(colnames(factors)[combinations[1:length(combinations[,1]),]],ncol = length(combinations[1,]))

yName       <- 'resVar'
cl <- makeCluster(4)
registerDoParallel(cl)
r <- foreach(subModelInd=1:noSubModel[2], .combine=cbind) %dopar% {
     require("pglm")
     vars <- coeff_vars[subModelInd,]
     formula <- as.formula(paste('as.numeric(', yName, ')',' ~ ', paste(vars,collapse=' + ')))
     XX<-X[,c("resVar",vars,"Year")]
     ans <- pglm(formula, data = XX, family = ordinal('logit'), model = "random", method = "bfgs", print.level = 3, R = 5, index = 'Year')

      coefficients(ans)

}
stopCluster(cl)
cl <- c()

When I try to parallelise it in the following way, it doesn't work. I get the following error:

Error in { : task 1 failed - "object 'XX' not found"

A set of several pglm regressions sequentially evaluated works:

require("pglm")
r <- foreach(icount(subModelInd), .combine=cbind) %do% {
     vars <- coeff_vars[subModelInd,]
     formula <- as.formula(paste('as.numeric(', yName, ')',' ~ ', paste(vars,collapse=' + ')))
     XX<-X[,c("resVar",vars,"Year")]
     ans <- pglm(formula, data = XX, family = ordinal('logit'), model = "random", method = "bfgs", print.level = 3, R = 5, index = 'Year')

     coefficients(ans)

}

Can someone please advice on how to parallelise this task correctly?

Thanks!

mike.dl
  • 67
  • 8
  • Where do you define object X? This assignment `XX<-X[,c("resVar",vars,"Year")]` what does it do? – Samuel Mar 13 '17 at 10:57
  • Sure, X is the source data set, that is defined before running the two loops, with `resVar` as dependent variable. Then it loops through the amount of sub models in `subModelInd` that goes from 1 to the number of sub-models. – mike.dl Mar 13 '17 at 11:04
  • Can you provide some sample data for XX to make it a minimal reproducible example http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example ? – rbm Mar 13 '17 at 11:32
  • @rbm I edited the post with the data frame – mike.dl Mar 13 '17 at 13:22
  • Sorry, but that doesn't reproduce the problem. When I ran the code, it works and I don't get the `object XX not found` error. – rbm Mar 13 '17 at 13:35
  • The first part (with %do%) does work. Hoever, the second part (with %dopar%) doesn't work and yields the error message. It seems like when you parallelise, it does not compute XX at run time. – mike.dl Mar 13 '17 at 13:41
  • Please paste a _single_ piece of code, which has the right libraries loaded and which reproduces the issue; please test it by restarting R session and running it in empty environment, then it'll be possible to fix your issue. I did not get any issues when I ran your first part and then the second, so it may be something with your setup. So again, a single piece of code i can copy/paste to reproduce the issue will help. – rbm Mar 13 '17 at 13:44
  • @rmb, thanks I followed your tips, but I still have the same error within the foreach loop. I updated my comment to make it more clear. The problem is certainly within the `pglm` function, since without it the code works. – mike.dl Mar 13 '17 at 14:19
  • Btw, I just tried the same example with `MASS::polr` function and it perfectly works. I just can't figure out why it is not working using `pglm`. A useful option could be `foreach(subModelInd=1:noSubModel[2], .combine=cbind, .packages='pglm') %dopar% {`, but I still face the same issue – mike.dl Mar 13 '17 at 16:22
  • OK - see answer – rbm Mar 13 '17 at 16:57
  • @mike.dl I think the problem is actually caused by `%dopar%` not working correctly with the pglm package. I've been doing some testing and there's something really odd going on with the callstack in the parallel process. I think this is something going wrong in the setup when packages import other packages. assigning to pos = 1 like rbm said, is a workable hack (and he beat me to it). – Joris Meys Mar 13 '17 at 17:09

1 Answers1

3

Yes, it does look like there is an issue with pglm and the way it accesses variables. A simple fix is to assign the XX into global variable, i.e. change the

XX<-X[,c("resVar",vars,"Year")]

to

assign("XX", X[,c("resVar",vars,"Year")], pos = 1)

This should do the trick, as each cluster runs as a separate process (not a separate thread as far as I know), so you won't have issue with two processes/threads trying to use the XX variable.

I added two extra lines - a set.seed(131) and another line after coefficients(ans),i.e.

set.seed(131)

... rest of your code ....
coefficients(ans)

write(paste0(coefficients(ans)[1],"\n"),file="c:\\temp\\r\\out.txt",append=TRUE)

and got consistently 6 lines in the file (same numbers, but obviously in different order):

0.703727602527463
1.03799340156792
1.15220874833614
1.30381769320552
1.42656613017171
1.77287504108163

That should work for you as well.

rbm
  • 3,243
  • 2
  • 17
  • 28