0

Below is just a toy sample to represent the issue for purposes of reproducible code, but my data and subsequent functions acting on the data are much more involved and would actually benefit from running in parallel.

The problem I have is that the loop below runs as expected both under %do% and %dopar%, but %dopar% is very slow relative to %do%.

I have narrowed down the problem to the fact that I am searching through a very large list, grabbing the data from that list by indexing to subset and then doing stuff to it.

Can someone offer insight into how the %dopar% for could be improved? In my actual data, I need to subset a data frame already stored in a list and then that df is passed to 4 different functions.

And also apologies, I did post this question on R-Help, but see more activity regarding foreach on Stack Exchange.

N <- 200000
myList <- vector('list', N)
names(myList) <- 1:N
for(i in 1:N){
  myList[[i]] <- rnorm(100)
}
nms <- 1:N
library(foreach)
library(doParallel)
registerDoParallel(cores=7)

result <- foreach(i = 1:3) %do% {
dat <- myList[[which(names(myList) == nms[i])]]
mean(dat)
}

result <- foreach(i = 1:3) %dopar% {
dat <- myList[[which(names(myList) == nms[i])]]
mean(dat)
}
dhc
  • 625
  • 1
  • 6
  • 14
  • Do not expect it to be nbCores times quicker, there is some work to be done to send the data to the workers and gather it back when whatever they do is finished. I advise you read an introduction like http://shop.oreilly.com/product/0636920021421.do – statquant Dec 04 '16 at 13:15
  • This looks like an overhead problem. You're only iterating 3 times, and each iteration is pretty fast. So the benefit of multi-threading is outweighed by the cost (overhead to split, apply, and recombine). Try changing `1:3` to `1:3000`, and you'll start to see the benefits of multi-threading. – rosscova Dec 04 '16 at 13:18
  • Also just a note, in your example, `which(names(myList) == nms[i])` is the same as `i`, since `names(myList)` and `nms` are identical. – rosscova Dec 04 '16 at 13:20
  • Thanks for the reply. Yes, in the toy example I'm only iterating 3 times. But, in the actual problem I iterate over all 200,000 lists. Even in the real problem, do finishes in roughly 1 hr, whereas dopar never finishes, or at least doesn't finish within many hours and so I stop the process. – dhc Dec 04 '16 at 13:21
  • Thanks rosscova. Again, this is just a toy example. In the actual data, nms[i] and i are not the same and so it must run differently. – dhc Dec 04 '16 at 13:22
  • This is surely a duplicate – statquant Dec 04 '16 at 13:22
  • Possible duplicate of [Why is the parallel package slower than just using apply?](http://stackoverflow.com/questions/14614306/why-is-the-parallel-package-slower-than-just-using-apply) – statquant Dec 04 '16 at 13:25
  • Iterating over all 200,000 list items takes about 190s using `%dopar%` on my machine. Maybe check your memory usage during the run. If you over-fill the RAM, your process will slow WAY down. – rosscova Dec 04 '16 at 13:29

0 Answers0