0

I'm writing a socket.io based server in Node.js (6.9.0). I am using the builtin cluster module to enable multiple processes. For now, there is only two process: a master and a worker. The master receives the connections and maintains an in-memory global data structure (which the worker can query via IPC). The worker process does the majority of work by handling each incoming connection.

I am finding a hanging condition that I cannot attribute to any internal failure when the server is stressed at 300 concurrent users. Under lower concurrency, I don't see the hanging condition.

I'm enabling all forms of debugging (using the debug module: socket.io:socket, socket.io:client as well as my own custom calls to debug).

The last activity I can see is in socket.io, however, the messages indicate that sockets are closing ("reason client namespace disconnect") due to their own "end of test" cycle. It just seems like incoming connections are not be serviced.

I'm using Artillery.io as the test client.

In the server application, I have handlers for uncaught exceptions and try-catch blocks around everything.

In a prior iteration, I also used cluster, but reversed the responsibilities so that the master process handled the connections (with the worker handling global data). That didn't exhibit the same failure. Not sure if something is wrong with the connection distribution. For that, I have also dumped internalMessage events to monitor the internal workings of cluster.

I am not using any other module for connection distribution or sticky sessions. As there is only a single process handling connections (at this time), it doesn't seem relevant.

gboysko
  • 31
  • 3
  • 1
    How are you passing the connections from master to worker? – robertklep Apr 03 '17 at 17:57
  • I'm using the builtin mechanism that `cluster` provides (as I understand). In essence, I'm doing nothing explicitly: the worker creates the server, initializes `socket.io` and then simply listens on a specific port. The `cluster` directs that worker `listen` call to the master and routes (via "round robin") each new connection to a worker. – gboysko Apr 03 '17 at 18:50
  • You could try the other method `cluster` provides (see [this](https://nodejs.org/api/cluster.html#cluster_cluster_schedulingpolicy), specifically `cluster.SCHED_NONE`), but it might also be worthwhile ruling other things out, like temporarily disabling the worker querying that global data structure that the master holds. I assume that having only one worker is temporarily (scale up to multiple workers once this issue is resolved)? – robertklep Apr 03 '17 at 19:08
  • Changing the connection distribution algorithm is definitely a good idea. As for the other suggestion, I am not able to disable the worker from querying the global state--it relies on it to function correctly. And yes, one worker is only temporary--I wanted to demonstrate correct behavior on 1 worker, before create one for each core. – gboysko Apr 03 '17 at 19:32
  • `cluster`'s handling of connections is fine for something like an HTTP server where is connection is independent. This is not the case for socket.io, as it needs to maintain state across connections (when using long polling instead of WebSockets, or to handle disconnects, etc.). So here, the master ends up distributing new connections randomly instead of to the same worker which handled previous connections for the same client. You may want to read https://socket.io/docs/using-multiple-nodes/#using-node.js-cluster and https://github.com/elad/node-cluster-socket.io – jcaron Apr 03 '17 at 21:48
  • See also http://stackoverflow.com/a/18650183/3527940 and the many more results from Google for "node cluster socket.io" – jcaron Apr 03 '17 at 21:49
  • Thanks, jcaron. I am aware of the limitations of cluster for a socket.io solution. As noted in my description, I am using a single worker process (for now) before proceeding to multiple worker processes. My goal is to proceed to the next step until the behavior is well known. I've read all of the socket.io docs. FWIW, it seems that `cluster.SCHED_NONE` is behaving much better than `cluster.SCHED_RR` and seems to remove the hanging condition. I may need to implement a custom connection distribution algorithm if I don't go with `sticky-session` or some other NPM module. – gboysko Apr 03 '17 at 22:09
  • To use socket.io with clustering, you have to either use sticky load balancing (so a repeat connection from a given client always goes to the same server process) or you have to change the socket.io client default so that it immediately connects with webSocket and doesn't use a couple http polling requests to start each connection. Without sticky load balancing to the same cluster server, socket.io reconnects (if the connection is temporarily interrupted) may lose state because they may go to a new server process upon reconnect. – jfriend00 Apr 03 '17 at 23:49
  • Thanks, jfriend00. I understand all of this. In my example, I'm using a single worker process. All requests start and end at the same process. – gboysko Apr 04 '17 at 15:28

1 Answers1

0

I was able to remove the hanging condition by changing the cluster scheduling policy from Round Robin (SCHED_RR) to None, which is OS specific (SCHED_NONE). I can't tell whether this is due to a bug in connection distribution (or something else inherent in the scheduling policy), but this one change seems to prevent the hanging condition.

gboysko
  • 31
  • 3