17

I am trying to use the parallel package, and found that makeCluster fails to complete. I've traced the hang to the following line in newPSOCKnode :

con <- socketConnection("localhost", port = port, server = TRUE, 
    blocking = TRUE, open = "a+b", timeout = timeout)

That command stalls (granted the default timeout is a large value). My suspicion is this is due to some "overzealous IT rules" laid down on our work computers, but would welcome any suggestions as to how to trace (and fix) the source of the problem. This is Windows7-64, "Enterprise", R 3.0.1 .

More info: inside debugging session, I set timeout < - 10, but it still hangs -- as though socketConnection is getting trapped somewhere that it can't even check the timeout value.

Here's my dump at the same point as Richie Cotton's data:

Browse[3]> ls.str()
arg :  chr "parallel:::.slaveRSOCK()"
cmd :  chr "\"C:/Users/carl.witthoft/Documents/R/R-3.0.1/bin/x64/Rscript\" -e \"parallel:::.slaveRSOCK()\" MASTER=localhost PORT=11017 OUT="| __truncated__
env :  chr "MASTER=localhost PORT=11017 OUT=/dev/null TIMEOUT=2592000 METHODS=TRUE XDR=TRUE"
machine :  chr "localhost"
manual :  logi FALSE
master :  chr "localhost"
methods :  logi TRUE
options : <environment: 0x000000000ccac6a0> 
outfile :  chr "/dev/null"
port :  int 11017
rank :  int 1
renice :  int NA
rscript :  chr "\"C:/Users/carl.witthoft/Documents/R/R-3.0.1/bin/x64/Rscript\""
timeout :  num 2592000
useXDR :  logi TRUE

So aside from a different port number, I think everything matches up.

Next trick: I opened a shell and ran netsh advfirewall firewall add rule name="Open Port 11017" dir=in action=allow protocol=TCP localport=11017 and got an "OK" response. I ran netstat -a -n and found the following line:

TCP 0.0.0.0:11017 0.0.0.0:0 LISTENING

But running makePSOCKcluster still hangs at the same place.

NEXT: I tried running R from the command line (via cygwin bash), and the error message I get is Error in loadhistory(file) : no history mechanism available Execution halted , after which -C returns me to the R-prompt.

Carl Witthoft
  • 20,573
  • 9
  • 43
  • 73
  • Specifying the appropriate path to `Rscript.exe` in `makeCluster(..., rscript = )` prevented that function from hanging for me. That was a while ago, though. – BenBarnes Oct 08 '13 at 13:33
  • @BenBarnes tried that -- no go. I checked; the default path was being correctly generated anyway. – Carl Witthoft Oct 08 '13 at 14:05
  • The master hangs there because it is waiting for the worker that it just started to connect back to it. The real problem is almost certainly in the worker. Try using the outfile option to see if the worker threw an error or if it's also hanging because of a firewall. – Steve Weston Oct 08 '13 at 14:40
  • It would be helpful to see the call to `makeCluster` since that controls how the workers are started. – Steve Weston Oct 08 '13 at 15:07
  • @SteveWeston I tried that; no error (or at least, no file ever gets created) generated – Carl Witthoft Oct 08 '13 at 15:17
  • If the "outfile" never got created, that means that something went wrong in starting the worker very early on. I suggest that you try manual mode: `cl <- makePSOCKcluster(3, manual=TRUE, outfile="log.txt")`. That may be the only way to see the error. – Steve Weston Oct 08 '13 at 17:41
  • @SteveWeston nope, that hangs as well. – Carl Witthoft Oct 08 '13 at 17:45
  • So the worker that you manually started hangs without an error message? I would try debugging that worker to see where it is hanging also. – Steve Weston Oct 08 '13 at 17:46
  • @SteveWeston I plead naivete here: I don't know what a "worker" means. All I can tell you is that `makePSOCKcluster` does not finish and no outfile is created. I can hit the `STOP` button in the Rgui but there's nothing new in my environment. Just to be clear- I'm trying to create a cluster on my home machine's (i7) cores. – Carl Witthoft Oct 08 '13 at 19:05

3 Answers3

10

What you're describing is the classic problem with PSOCK clusters: makeCluster hangs. It can hang for dozens of reasons because it has to create all of the processes, called "worker" processes, that will perform the actual work of the "cluster", and that involves starting new R sessions using the Rscript command that will execute the .slaveRSOCK function, which will create a socket connection back to the master and then execute the slaveLoop function where it will eventually execute the tasks sent to it by the master. If anything goes wrong starting any of the worker processes (and trust me: a lot can go wrong), the master will hang while executing socketConnection, waiting for the worker to connect to it even though that worker may have died or never even been created successfully.

For many failure scenarios, using the outfile argument is great because it often reveals the error that causes the worker process to die and thus the master to hang. But if that reveals nothing, I go to manual mode. In manual mode, the master prints the command to start each worker instead of executing the command itself. It's more work, but it gives you complete control, and you can even debug into the workers if you need to.

Here's an example:

> library(parallel)
> cl <- makePSOCKcluster(1, manual=TRUE, outfile='log.txt')
Manually start worker on localhost with
   '/usr/lib/R/bin/Rscript' -e 'parallel:::.slaveRSOCK()' MASTER=localhost
PORT=10187 OUT=log.txt TIMEOUT=2592000 METHODS=TRUE XDR=TRUE 

At this point, your R session is hung because it's executing socketConnection, just as you described. It's now your job to open a new terminal window (command prompt, or whatever), and paste in that Rscript command. As soon as you've executed it, makePSOCKcluster should return since we only requested one worker. Of course, if something goes wrong, it won't return, but if you're lucky, you'll get an error message in your terminal window and you'll have an important clue that will hopefully lead to a solution to your problem. If you're not so lucky, the Rscript command will also hang, and you'll have to dive in even deeper.

To debug the worker, you don't execute the displayed Rscript command because you need an interactive session. Instead, you start an R session with a command such as:

$ R --vanilla --args MASTER=localhost PORT=10187 OUT=log.txt TIMEOUT=2592000 METHODS=TRUE XDR=TRUE

In that R session, you can put a breakpoint on the .slaveRSOCK function and then execute it:

> debug(parallel:::.slaveRSOCK)
> parallel:::.slaveRSOCK()

Now you can start stepping through the code, possibly setting breakpoints on the slaveLoop and makeSOCKmaster functions. In your case, I assume that it will hang trying to create the socket connection, in which case the title of your question will be appropriate.

For more information on this kind of problem, see my answer to a similar question.

UPDATE

Now that this particular problem has been resolved, I can add two tips for debugging makePSOCKcluster problems:

  • Check to see if anything in your .Rprofile only works in interactive mode
  • On Windows, use the Rterm command rather than Rgui so that you're more likely to see error messages and output from using outfile=''.
Community
  • 1
  • 1
Steve Weston
  • 19,197
  • 4
  • 59
  • 75
  • I get the following: ` 'C:/Users/carl.witthoft/Documents/R/R-3.0.1/bin/x64/Rscript' -e 'parallel:::.slaveRSOCK()' MASTER=localhost PORT=11017 OUT=mylog.txt TIMEOUT=2592000 METHODS=TRUE XDR=TRUE Error in .External2(C_loadhistory, file) : 'loadhistory' can only be used in Rgui and Rterm Calls: loadhistory Execution halted` . I'm hoping this means I need to modify my .Rprofile to avoid that loadhistory call. I'll chase that down. – Carl Witthoft Oct 09 '13 at 11:46
  • OK, removing `loadhistory` from `.Rprofile` avoids that halt. However, neither the bash terminal nor the Rgui returns. I appreciate your help; if I ever solve this I'll post the answer. – Carl Witthoft Oct 09 '13 at 11:53
  • @CarlWitthoft You found and removed one problem that had to be fixed in order to create PSOCK clusters, but I can't say I'm surprised that you ran into another. This is an error prone business. Hopefully this information will help somebody sometime. – Steve Weston Oct 09 '13 at 13:01
3

Test 1: Does the obvious command work?

library(parallel)    
cluster <- makePSOCKcluster("localhost")    
parSapply(cluster, 1:5, sqrt)
stopCluster(cluster)

Test 2: Is your port blocked?

According to ?makeCluster, the default port is 10187. Check with your network admin to see if that port is open.

Test 3: Do the variables passed in to socketConnection look right?

If I do debugonce(parallel:::newPSOCKnode) and then stepping to through to just before the call to socketConnection, the workspace look this this:

ls.str()
arg :  chr "parallel:::.slaveRSOCK()"
## cmd :  chr "\"C:/PROGRA~1/R/R-215~1.2/bin/x64/Rscript\" -e \"parallel:::.slaveRSOCK()\" MASTER=localhost PORT=10187 OUT=/dev/null TIMEOUT=2"| __truncated__
## env :  chr "MASTER=localhost PORT=10187 OUT=/dev/null TIMEOUT=2592000 METHODS=TRUE XDR=TRUE"
## machine :  chr "localhost"
## manual :  logi FALSE
## master :  chr "localhost"
## methods :  logi TRUE
## options : <environment: 0x0000000010bf2518> 
## outfile :  chr "/dev/null"
## port :  num 10187
## rank :  int 1
## renice :  int NA
## rscript :  chr "\"C:/PROGRA~1/R/R-215~1.2/bin/x64/Rscript\""
## timeout :  num 2592000
## useXDR :  logi TRUE

Are you getting the same things passed in?

Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
  • Oddly, trying to `debug(socketConnection)` makes R hang for me, as does running the code snippet in the question straight from the global environment. – Richie Cotton Oct 08 '13 at 14:57
  • See my `ls.str` dump above. BTW, isn't there some command (either in cmd.exe or from a cygwin bash window I can use to see what my port statuses are? – Carl Witthoft Oct 08 '13 at 15:11
  • @CarlWitthoft: Take a look here for finding blocked ports. http://serverfault.com/questions/26564/how-to-check-if-a-port-is-blocked-on-windows – Richie Cotton Oct 08 '13 at 16:18
  • I added my foray into `netstat` and `netsh` to the question. – Carl Witthoft Oct 08 '13 at 16:55
2

Well, don't I feel like a complete idiot.

I went back to "The Three R's of Software Debugging" (Retry, Reboot, Reload), and after rebooting my system and successfully doing the manual worker startups, I tried creating a cluster with manual=FALSE and had immediate success there as well.

EDIT: I should make it clear that changing my .Rprofile from loadhistory() to if(interactive() ) loadhistory() was critical to successful use of the cluster functions.

I'm very grateful to Richie and Steve for all their helpful comments and suggestions. I've certainly learned a bunch of stuff "under the hood," so the experience was quite positive at least for me.

(So I've no idea what WindowsOS thingie or broken call had been getting in the way, but all's well that ends well)

Carl Witthoft
  • 20,573
  • 9
  • 43
  • 73
  • Did you add `loadhistory` back to your `.Rprofile`? Was that necessary to fix the problem or not? – Steve Weston Oct 09 '13 at 13:09
  • I really don't think this is the answer. I think the problem was loadhistory. – Steve Weston Oct 09 '13 at 14:20
  • @SteveWeston while it's true that `loadhistory` was **a** problem, that wouldn't explain the fix after restarting the OS. I had shut down `R` and restarted `R` with the modified `.Rprofile` and still couldn't run `cluster` functions. So there were two problems, one of which was `loadhistory` and the other remains a mystery. – Carl Witthoft Oct 09 '13 at 14:35
  • 1
    It's impossible to prove that rebooting fixed a problem, and even if it did, it's an answer that doesn't help anyone. I think that this answer obscures an important failure mode that you've uncovered with your experiments. You shouldn't feel like an idiot at all. You've uncovered something that could help other R Windows users that have trouble executing makePSOCKcluster. – Steve Weston Oct 09 '13 at 14:45