10

I'm running a parallel operation using a SOCK cluster with workers on the local machine. If I limit the set I'm iterating over (in one test using 70 instead of a full 135 tasks) then everything works just fine. If I go for the full set, I get the errror "Error in unserialize(socklist[[n]]) : error reading from connection".

  • I've unblocked the port in Windows Firewall (both in/out) and allow all access for Rscript/R.

  • It can't be a timeout issue because the socket timeout is set to 365 days.

  • Its not an issue with any particular task because I can run sequentially just fine (also runs fine in parallel if I split the dataset in half and do two separate parallel runs)

  • The best I can come up with is that there is too much data being transferred over the sockets. There doesn't seem to be a cluster option to throttle data limits.

I'm at a loss on how to proceed. Has anyone seen this issue before or can suggest a fix?

Here's the code I'm using to setup the cluster:

cluster = makeCluster( degreeOfParallelism , type = "SOCK" , outfile = "" )
registerDoSNOW( cluster )

Edit
While this issue is constent with the entire dataset, it also appears from time-to-time with a reduced dataset. That might suggest that this isn't simply a data limit issue.

Edit 2
I dug a little deeper and it turns out that my function in fact has a random component that makes it so that sometimes a task will raise an error. If I run the tasks serially then at the end of the operation I'm told which task failed. If I run in parallel, then I get the "unserialize" error. I tried wrapping the code that gets executed by each task in a tryCatch call with error = function(e) { stop(e) } but that also generates the "unserialize" error. I'm confused because I thought that snow handles errors by passing them back to the master?

Suraj
  • 35,905
  • 47
  • 139
  • 250
  • R is limited to 128 simultaneous open connections... maybe that's part of it? – Joshua Ulrich Oct 14 '11 at 20:52
  • I am testing with 8 connections. – Suraj Oct 14 '11 at 21:09
  • But your question says everything works fine with 70 tasks, so I'm confused. – Joshua Ulrich Oct 15 '11 at 22:57
  • I think you're confusing tasks with connections. I have up to 8 connections processing many more tasks. In this case I have 135 tasks that I want to run in parallel, but only 8 cores on the CPU on which to process those tasks (in practice I never go above 7 - like to leave one available for the OS) – Suraj Oct 16 '11 at 14:02
  • Yes, I'm confused because the packages you're using don't use "tasks" to describe anything they do and you don't provide an example of what you mean by "tasks", so I'm trying to figure out what you mean. A minimal example that produces the behavior you describe would go a long way toward someone helping. As it stands, you require someone to replicate the behavior before they can even start investigating the cause. This may be why the author of snow ignored your email. – Joshua Ulrich Oct 16 '11 at 17:09
  • Take a look at the foreach Vignette and reference manual. Both refer to "tasks". I'm also fairly certain that the console output when you execute in parallel uses "task" as the way to describe each iteration of a for loop that runs in parallel. I should work up a small example, but I'm certain that that's not why my email was ignored, I just think Luke is busy. At the end of the email I actually asked if I had provided enough information for us to dig deeper which would have prompted a "no" answer. – Suraj Oct 16 '11 at 19:50

1 Answers1

3

I have reported this issue to the author of SNOW but unfortunately there has been no reply.

Edit
I haven't seen this issue in a while. I moved to Parallel/doParallel. Also, I'm now using try() to wrap any code that gets executed in parallel. I can't repro the original issue.

Suraj
  • 35,905
  • 47
  • 139
  • 250
  • How are you positive this is a bug in snow and not foreach and/or doSNOW? – Joshua Ulrich Oct 16 '11 at 17:07
  • not positive. Unchecked my answer until I can come up with a working example. – Suraj Oct 16 '11 at 19:51
  • It feels like multiple worker nodes are trying to read from the same file on the harddrive. I had several models running in parallel taht wrote to the same file, which lead to all kinds of random errors. However, you do not provide much detail on what the function exactly does, if it accesses the hardrive, so we are left guessing to some extent. – Paul Hiemstra Nov 30 '11 at 13:39
  • This is a good guess...I'm not writing output to file so this isn't the issue (actually, I wanted to pump output to the parent/master R screen but could not get that to work). It seems to happen when there is an error thrown by the child processes, but there is supposed to be an error handling facility. It seems this error handler does not always work. I haven't had the time to repro and just hoping the issue goes away usin package Parallel – Suraj Nov 30 '11 at 13:43
  • I am having the same problem, any leads to solving it? Like SFun28, my function has quite a few random component to it. – user1234440 Nov 19 '13 at 07:37