I am using the doParallel
package to parallelize jobs across multiple Linux machines using the following syntax:
cl <- makePSOCKcluster(machines, outfile = '', master = system('hostname -i', intern = T))
Typically each job would take less than 10 minutes to run on a single machine. However, sometimes there would be one worker process that would 'run away' and kept running for hours and never returned to the main driver process. I can see the process running using top
, but it seems like the process is somehow stuck rather than running anything for real. The outfile=''
option doesn't produce anything useful since the worker process never really failed.
This happens rather frequently but very randomly. Sometimes I could just re-start the jobs and they would finish fine. Therefore, I cannot provide a reproducible example. Does anyone have general suggestions on how to investigate this issue? Or what to look for when this happens again in the future?
EDIT:
Adding more details in response to the comments. I am running thousands of small simulations on 10 machines. IO and memory usage are both minimal. I have noticed the worker process running away on different machines at random without any pattern, not necessarily the busiest ones. I don't have permission to view the system log, but based on CPU/RAM history there doesn't seem to be anything unusual.
It happens frequently enough that it's fairly easy to catch a run-away process in action. top
would show that the process is running with close to 100% CPU with status R
, so it is definitely running and not waiting. But I am also quite sure that each simulation should only take minutes, and somehow the run-away worker just keeps running non-stop.
So far doParallel
is the only package I have tried. I am exploring other options, but it's hard to make an informed decision without knowing the cause.