37

I am loading the following packages into R:

library(foreach)
library(doParallel)
library(iterators)

I "parallelize" code for a long time, but lately I am getting INTERMITTENT stops while code is running. The error is:

Error in serialize(data, node$con) : error writing to connection

My educated guess is that maybe the connection that I open using the commands below, has expired:

## Register Cluster
##
cores<-8
cl <- makeCluster(cores)
registerDoParallel(cl)

Looking at makeCluster man page I see that by default the connections expires only after 30 days! I could set options(error=recover) in order to check, on the fly, if the connection is opened or not when the code halts, but I decided to post this general question before.

IMPORTANT:

1) the error is really intermittent, sometimes I re-run the same code and get no errors. 2) I run everything on the same multi-core machine (Intel/8 cores). So it is not a communation (network) problem among the clusters. 3) I am a heavy user of CPU and GPU parallelization, on my laptop and desktop (64 cores) Unfortunately, it is the first time that I am getting this type of error.

Is anybody having the same type of error?

As requested I am providing my sessionInfo():

> sessionInfo()
R version 2.15.3 (2013-03-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] TTR_0.22-0       xts_0.9-3        doParallel_1.0.1 iterators_1.0.6  foreach_1.4.0    zoo_1.7-9        Revobase_6.2.0   RevoMods_6.2.0  

loaded via a namespace (and not attached):
[1] codetools_0.2-8 compiler_2.15.3 grid_2.15.3     lattice_0.20-13 tools_2.15.3   

@SeteveWeston, below the error in one of the calls (again it is intermittent):

starting worker pid=8808 on localhost:10187 at 15:21:52.232
starting worker pid=5492 on localhost:10187 at 15:21:53.624
starting worker pid=8804 on localhost:10187 at 15:21:54.997
starting worker pid=8540 on localhost:10187 at 15:21:56.360
starting worker pid=6308 on localhost:10187 at 15:21:57.721
starting worker pid=8164 on localhost:10187 at 15:21:59.137
starting worker pid=8064 on localhost:10187 at 15:22:00.491
starting worker pid=8528 on localhost:10187 at 15:22:01.855
Error in unserialize(node$con) : 
  ReadItem: unknown type 0, perhaps written by later version of R
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Execution halted
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Execution halted
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Execution halted
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Execution halted
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Execution halted
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Execution halted
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Execution halted
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Execution halted

Adding a bit more information. I set options(error=recover) and it provided the following information:

Error in serialize(data, node$con) : error writing to connection

Enter a frame number, or 0 to exit   

1: #51: parallelize(FUN = "ensemble.prism", arg = list(prism = iis.long, instances = oos.instances), vectorize.arg = c("prism", "instances"), cores = cores, .export 
2: parallelize.R#58: foreach.bind(idx = i) %dopar% pFUN(idx)
3: e$fun(obj, substitute(ex), parent.frame(), e$data)
4: clusterCall(cl, workerInit, c.expr, exportenv, obj$packages)
5: sendCall(cl[[i]], fun, list(...))
6: postNode(con, "EXEC", list(fun = fun, args = args, return = return, tag = tag))
7: sendData(con, list(type = type, data = value, tag = tag))
8: sendData.SOCKnode(con, list(type = type, data = value, tag = tag))
9: serialize(data, node$con)

Selection: 9

I tried to check if the connections were still available, and there are:

Browse[1]> showConnections()
   description                class      mode  text     isopen   can read can write
3  "<-www.007guard.com:10187" "sockconn" "a+b" "binary" "opened" "yes"    "yes"    
4  "<-www.007guard.com:10187" "sockconn" "a+b" "binary" "opened" "yes"    "yes"    
5  "<-www.007guard.com:10187" "sockconn" "a+b" "binary" "opened" "yes"    "yes"    
6  "<-www.007guard.com:10187" "sockconn" "a+b" "binary" "opened" "yes"    "yes"    
7  "<-www.007guard.com:10187" "sockconn" "a+b" "binary" "opened" "yes"    "yes"    
8  "<-www.007guard.com:10187" "sockconn" "a+b" "binary" "opened" "yes"    "yes"    
9  "<-www.007guard.com:10187" "sockconn" "a+b" "binary" "opened" "yes"    "yes"    
10 "<-www.007guard.com:10187" "sockconn" "a+b" "binary" "opened" "yes"    "yes"    
Browse[1]> 

Since the connections are open and error 0 means R version (as pointed out by @SteveWeston), I really can;t figure out what is happening here.

EDIT 1:

MY WORKAROUND TO THE PROBLEM

The code is fine in terms of arguments passed to the function. Thus, the answer provided by @MichaelFilosi haven't brought much to the table. In any manner, many thanks for your answer!

I couldn't find exactly what was wrong with the call, but, at least, I could workaround the problem.

The trick was to break the arguments of function call, for each parallel thread, into smaller blocks.

Magically the error disappeared.

Let me know if the same worked for you!

Marcelo Sardelich
  • 952
  • 2
  • 13
  • 28
  • 1
    Can you provide your sessionInfo() – Tommy Levi Jun 10 '13 at 03:20
  • 1
    Try using the makeCluster `outfile=''` option so you will see any error messages that may be issued by the workers when this happens. You can specify a filename with outfile if you're using Rgui on Windows which I believe doesn't support `outfile=''`. – Steve Weston Jun 10 '13 at 22:02
  • @TommyLevi, I just attached it! thanks for your prompt reply. – Marcelo Sardelich Jun 11 '13 at 00:10
  • @SteveWeston, I will implement it. In your opinion what is occasioning the intermittent error? – Marcelo Sardelich Jun 11 '13 at 00:20
  • @SteveWeston, I added outfile description in the post. – Marcelo Sardelich Jun 11 '13 at 16:02
  • 1
    The "unknown type 0" message is very puzzling. Since you're running on a single machine, it doesn't seem possible that some worker is using a different versions of R. That suggests that the socket connection is getting corrupted, as if some part of your program is accidentally writing data to it, but that doesn't seem likely, and would be difficult to track down. – Steve Weston Jun 11 '13 at 17:52
  • 1
    That can't be your full sessionInfo(), I wanted the versions of the packages you have running when the errors happen, i.e. foreach, etc. – Tommy Levi Jun 11 '13 at 19:10
  • @TommyLevi, I updated the post with the correct information. – Marcelo Sardelich Jun 11 '13 at 21:53
  • 5
    @ChuckyKillerDoll did you ever find the cause? I'm receiving similar intermittent errors – Ian Fellows Feb 27 '14 at 03:27
  • The code was fine, in terms of arguments passed. Thus, the answer from @MichaelFilosi haven't brought much to the table. I couldn't find what is wrong with the call, but, at least, I could workaround the problem. The trick was to break the function call in blocks, and iterate it in a loop...something similar. Let me know if it does for you. – Marcelo Sardelich Feb 28 '14 at 04:52
  • @IanFellows, any luck? – Marcelo Sardelich Mar 21 '14 at 01:25
  • 4
    can you please explain by an example what you mean by "to break the arguments of function call, for each parallel thread, into smaller blocks"? Would be grateful – Agile Bean Nov 13 '19 at 14:50
  • I have the same problem, did you find the solution ? – AmData Feb 20 '20 at 13:04

5 Answers5

18

This is most likely due to running out of memory (see my blog post for details). Here's an example how you can cause this error:

> a <- matrix(1, ncol=10^4*2.1, nrow=10^4)
> cl <- makeCluster(8, type = "FORK")
> parSapply(cl, 1:8, function(x) {
+   b <- a + 1
+   mean(b)
+   })
Error in unserialize(node$con) : error reading from connection
Max Gordon
  • 5,367
  • 2
  • 44
  • 70
  • 3
    I also think that this is due to memory issues. I solved it by creating smaller threads requiring less memory in my problem – mondano Feb 05 '16 at 10:46
  • I just got this problem and thought it was a memory issue, but when I ran the involved code by hand rather than via Rscript, the problem did not appear. Also, I did not have this problem on a different (more powerful) computer. – cgmil Sep 14 '18 at 22:50
3

I struggled with this problem for quite a while and was able to fix it by moving all my required packages into the arguments within the foreach loop by using .packages=c("ex1","ex2"). Previously I had just used require("ex1") inside the loop and this seems to have been the root cause of my errors.

Overall, I would just make sure you're moving everything possible into the foreach arguments to avoid these types of errors.

2

I got a similar error Error in unserialize(node$con) : error reading from connection

I found out it was a missing argument in a call to a C function trough .Call() Maybe it can be of help!

  • @Filosi Thanks for sharing! I solved the problem by calling the parallelization routine within a loop. – Marcelo Sardelich Oct 04 '13 at 02:32
  • 9
    @Filosi Can you provide more details to what solved your problem? For instance, which argument was missing; which line of code, ... etc. Cheers –  Aug 19 '14 at 19:03
1

I am having the same issue and I doubt it is a memory issue. My code is as simple as:

library(doParallel)
library(foreach)
cl <- makeCluster(2, outfile='LOG.TXT')
registerDoParallel(cl)
res <- foreach(x=1:10) %dopar% x

and I got the following error message in LOG.TXT:

starting worker pid=13384 on localhost:11776 at 18:25:29.873
starting worker pid=21668 on localhost:11776 at 18:25:30.266
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Execution halted
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Execution halted

The program works anyway, so I just ignored it for now. However, I always feel uncomfortable to see those Errors in the log file.

Yue Zhao
  • 59
  • 4
  • 2
    I cannot only reproduce this if I quit R abruptly without first stopping the cluster using `stopCluster(cl)`. This also happens when using the parallel package only, e.g. `library("parallel"); cl <- makeCluster(1L, outfile = 'log.out'); parLapply(cl, X = 1, fun = function(x) x); quit("no")`. But if you add `stopCluster(cl)`, the error is not there. – HenrikB Apr 11 '17 at 04:42
1

I had the same error using the foreach with a doSNOW backend.

I received the same error as the op after the timeout, but when running the task without using foreach no error would be returned.

Apparently, the task manager can kill processes due to a multitude of reasons, not only lack of memory.

In my particular case, it seems that the problem was the core temperature. Reducing the number of cpu cores and putting a sys.sleep() call made the system run cooler, and the error stopped appearing.

It may be worth a try.

Elijah
  • 414
  • 3
  • 8