When I am using mclapply, from time to time (really randomly) it gives incorrect results. The problem is quite thoroughly described in other posts across the Internet, e.g. (http://r.789695.n4.nabble.com/Bug-in-mclapply-td4652743.html). However, no solution is provided. Does anyone know how to fix this problem? Thank you!
-
2mclapply returns null when one of the forked processes takes a long time to return; I expect there is some kind of built in timeout that causes the process to die after a certain amount of time, but I cannot find it anywhere in the source code. – James King Jan 05 '14 at 06:38
-
Another link where users report this problem: http://r.789695.n4.nabble.com/Problem-with-mclapply-losing-output-data-td3395746.html – James King Jan 05 '14 at 16:07
-
Kudzu, Steve Weston suggested that the out of memory kiluler might be terminating mclapply processes, which turned out to be the cause of the nulls in my case. Can you verify that oom_killer is causing your nulls as well? – James King Jan 09 '14 at 00:21
2 Answers
The problem reported by Winston Chang that you cite appears to have been fixed in R 2.15.3. There was a bug in mccollect
that occurred when assigning the worker results to the result list:
if (is.raw(r)) res[[which(pid == pids)]] <- unserialize(r)
This fails if unserialize(r)
returns a NULL, since assigning a NULL to a list in this way deletes the corresponding element of the list. This was changed in R 2.15.3 to:
if (is.raw(r)) # unserialize(r) might be null
res[which(pid == pids)] <- list(unserialize(r))
which is a safe way to assign an unknown value to a list.
So if you're using R <= 2.15.2, the solution is to upgrade to R >= 2.15.3. If you have a problem using R >= 2.15.3, then presumably it's a different problem then the one reported by Winston Chang.
I also read over the issues discussed in the R-help thread started by Elizabeth Purdom. Without a specific test case, my guess is that the problem is not due to a bug in mclapply because I can reproduce the same symptoms with the following function:
work <- function(i, poison) {
if (i == poison) quit(save='no')
i
}
If a worker started by mclapply dies while executing a task for any reason (receiving a signal, seg faulting, exiting), mclapply will return a NULL for all of the tasks that were assigned to that worker:
> library(parallel)
> mclapply(1:4, work, 3, mc.cores=2)
[[1]]
NULL
[[2]]
[1] 2
[[3]]
NULL
[[4]]
[1] 4
In this case, NULL's were returned for tasks 1 and 3 due to prescheduling, even though only task 3 actually failed.
If a worker dies when using a function such as parLapply or clusterApply, an error is reported:
> cl <- makePSOCKcluster(3)
> parLapply(cl, 1:4, work, 3)
Error in unserialize(node$con) : error reading from connection
I've seen many such reports, and I think they tend to happen in large programs that use lots of packages that are hard to turn into reproducible test cases.
Of course, in this example, you'll also get an error when using lapply, although the error won't be hidden as it is with mclapply. If the problem doesn't seem to happen when using lapply, it may be because the problem rarely occurs, so it only happens in very large runs that are executed in parallel using mclapply. But it is also possible that the error occurs, not because the tasks are executed in parallel, but because they are executed by forked processes. For example, various graphics operations will fail when executed in a forked process.

- 19,197
- 4
- 59
- 75
-
1I am seeing the problem on R 3.0.1, so I think it is something else. Also it's not a matter of the function returning null and deleting an element of the list. Rather the function returns a non-null value when running with lapply, but when run in parallel the resulting list contains nulls. I'm working on an example so others can reproduce the issue. – James King Jan 07 '14 at 04:25
-
I am using R on school cluster, which is indeed using ver 2.15.2. I will try to see if it will be solved with a newer version. Thank you! – Kudzu Jan 07 '14 at 05:28
-
This is helpful. The function I am calling should not a priori fail when run in parallel, but I wonder if parallel execution on 16 cores is constraining resources to the point where the function sometimes doesn't work. I will have some time tomorrow and try to get an example posted (which will unfortunately be a large run with a lot of data). – James King Jan 08 '14 at 04:46
-
2@user3114046 On a Linux system, if memory gets low due to many workers the out-of-memory killer might kill some of the workers, resulting in mclapply returning NULL's. That happens on our clusters where we have limited swap space. – Steve Weston Jan 08 '14 at 13:10
-
1@Steve Weston Awesome! R invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0. There looks to be plenty of memory available but perhaps the processes are caged in some way. So I wonder if there is a way for mclapply to inform the user if a process was killed? I've examined the code but its not clear to me how this could be done. – James King Jan 09 '14 at 00:14
-
1@user3114046 I'm a bit surprised that it issues no warning or message of any kind in this case. It would be a very reasonable feature request to submit to R core. – Steve Weston Jan 09 '14 at 16:53
-
I may submit a request. The same thing happens if I manually kill one of the workers - nulls in the answer with no warning that a process was killed. – James King Jan 09 '14 at 17:22
I'm adding this answer so others hitting this question won't have to wade through the long thread of comments (I am the bounty granter but not the OP).
mclapply
initially populates the list it creates with NULLS. As the worker processes return values, these values overwrite the NULLS. If a process dies without ever returning a value, mclapply
will return a NULL.
When memory becomes low, the Linux out of memory killer (oom killer)
https://lwn.net/Articles/317814/
will start silently killing processes. It does not print anything to the console to let you know what it's doing, although the oom killer activities show up in the system log. In this situation the output of mclapply
will appear to have been randomly contaminated with NULLS.

- 6,229
- 3
- 25
- 40
-
Exactly what I am dealing with now...what did you do to stop the oom killer? – gannawag Apr 20 '18 at 14:44
-
1@gannawag I ran fewer processes so that the memory did not get exhausted. – James King Apr 21 '18 at 23:23