2

I am running code in R within a linux cluster - the code is complex (over two thousand lines of code), involves over 40 R packages and several hundred variables. However, it does run both on Windows and linux versions of R.

I am now running the code on Edinburgh University EDCF high performance computing cluster and the code is ran in parallel. The parallel code is called within DEoptim which basically, after some initialization, runs a series of functions in parallel and the results are sent back to the DEoptim algorithm as well as being saved as a plot and data table on my own space - and importantly the code runs and works!

The code models the hydrology of a region and I can set the code to simulate historic conditions over any time period I want - from one day to 30 years. For one month in parallel, results are spat out approximately every 70 seconds and the DEoptim algorithm simply keeps re-running the code changing the input parameters trying to find the best set of parameters.

The code seems to run fine for a number of runs but eventually crashes. Last night the code completed over a 100 runs with no problem over approximately 2 hours but eventually crashed - and it always eventually crashes - with the error code:

Error in unserialize(node$con) : error reading from connection

The system I am logging onto is a 16 core server (16 true cores) according to:

detectCores()

and I requested 8 slots of 2GB memory. I have tried running this on a 24 core machine with large memory request (4 slots of 40GB memory) but it still eventually crashes. This code ran fine for several weeks on a Windows machine spitting out thousands of results, running in parallel across 8 logical cores.

So I believe the code is okay, but why is it crashing? Could it be a memory issue? Each time the sequence is called it includes:

rm(list=ls())
gc()

Or is it simply a core crashing? I did think at some point that it could be a problem if two cores were trying to write to the same data file at the same time but I removed this temporarily and it still crashed. Sometimes it crashes after a few minutes and other times after a couple of hours. I have tried removing one core from the parallel code using:

cl <- parallel::makeCluster(parallel::detectCores()-1)

but it still crashed.

Is there anyway that the code could be modified so it rejects crashed outputs e.g. if error then reject and carry on!!

Or, is there a way of modifying the code to catch why the error happened at all?

I know there are lots of other serialize(node$con) and unserialize(node$con) error posts but they don't seem to help me.

I'd really appreciate some help.

Thanks.

AntonyDW
  • 349
  • 5
  • 17
  • You could try wrapping the offending piece of code into `tryCatch` and output the result and objects to a text/.RData file. – Roman Luštrik Nov 14 '15 at 15:31
  • I'd like to know how to use tryCatch - sounds a possibility. – AntonyDW Nov 16 '15 at 09:25
  • I've created another post to try and understand how to use tryCatch with my code http://stackoverflow.com/questions/33733102/how-to-use-trycatch-in-r-with-parallel-code – AntonyDW Nov 16 '15 at 10:42
  • 1
    @SteveWeston's [answer](http://stackoverflow.com/questions/16572544/error-handling-within-parapply-in-r-using-parallel-package/16576397#16576397) suggests that the problem is that a worker _quits_, so that tryCatch() does not help. It might quit because of a programming error in the package you are using, or a resource limitation (e.g., memory use) on one of the nodes in the analysis, or some other reason. Probably you should try to figure this out by simplifying the problem, perhaps as suggested in the updated question Steve responds to. – Martin Morgan Nov 16 '15 at 12:26
  • I believe it is a resource issue. I'm going to try and rm(list=ls()) type approach at the beginning of each run. I just need to find out how that works in parallel - I don't want to delete all variables on all workers but just those loaded onto a single worker and the beginning of each run. – AntonyDW Nov 16 '15 at 19:11
  • 1
    I managed to fix this by adding type = "FORKS" when creating the cluster and stayed clear of rm() type commands. – AntonyDW Nov 23 '15 at 15:38

1 Answers1

-1

I had a similar problem running in parallel code that was dependent on several other packages. Try using foreach() with %dopar% and specify the packages your code depends on with the .packages option to load the packages onto each worker. Alternatively, judicious use of require() within the parallel code may also work.

caewok
  • 91
  • 1
  • 3