I am using foreach::foreach()
in R to run an analysis in parallel. I am using a computing cluster with 1 node, 500Gb of RAM and 30 cores. I initialize the cluster using:
myCluster <- parallel::makeCluster(28)
doParallel::registerDoParallel(myCluster)
The process runs through completely and takes around 8 hours to complete, however, the foreach
loop does not combine the results and returns a null object (lcp_network
). The loop code looks like this (not a reprex):
lcp_network <- foreach::foreach(i = 1:nrow(comps), .errorhandling = "remove", .combine = "rbind", .packages = c("sf", "terra","leastcostpath","dplyr")) %dopar% {
lcp <- leastcostpath::create_lcp(cost_surface = tr1,
origin = nodes_sp[comps[i,1],, drop = FALSE],
destination = nodes_sp[comps[i,2],, drop = FALSE])
lcp$origin_ID <- nodes_sp[comps[i,1],]$layer
lcp$destination_ID <- nodes_sp[comps[i,2],]$layer
lcp <- lcp %>%
st_as_sf() %>%
mutate(length = st_length(.)) %>%
st_drop_geometry()
attributes(lcp$length) <- NULL
return(lcp)
}
Notably, this same code runs on a smaller subset of data on my personal computer (8Gb of RAM, 8 cores) and combines, no problem. The error message given when using the .verbose
argument is:
numValues: 43, numResults: 0, stopped: TRUE
got results for task 1
accumulate got an error result
numValues: 43, numResults: 1, stopped: TRUE
returning status FALSE
got results for task 2
...
returning status FALSE
got results for task 43
numValues: 43, numResults: 43, stopped: TRUE
not calling combine function due to errors
returning status TRUE
Any advice is helpful. I have tried adding gc()
within the loop, among other attempted fixes.
EDIT: I noticed that the first description in the verbose statement notes:
accumulate got an error result
and at every point thereafter, it notes:
returning status FALSE
EDIT 2: I ran the same code on a different server, using the same parameters (500Gb of RAM, 30 cores). The error code is different now:
numValues: 43, numResults: 0, stopped: TRUE
Error in unserialize(socklist[[n]]) : error reading from connection
Calls: %dopar% ... recvOneData -> recvOneData.SOCKcluster -> unserialize
Execution halted
slurmstepd: error: Detected 8 oom-kill event(s) in StepId=13251537.batch.
Some of your processes may have been killed by the cgroup out-of-memory handler.