1

The code

I have a clustered Node application that listens to TCP traffic and parses binary data into JSON format.

But here's the catch: all incoming traffic comes across a single persistent connection.

As I understand it, cluster will balance load on a single port by distributing new sockets across workers, but there is no native way to distribute the load of a single socket across workers.

In order to do so, I've set up the cluster master to accept the incoming connection and segment the messages. It then explicitly passes messages to cluster workers in round robin fashion. When the stream responsible for segmenting messages emits a new message, it simply uses the cluster messaging API to send the message to the next Worker/parser in line:

// (cluster.isMaster === true)

var gateway = new Gateway(config.gateway.port, config.gateway.host);
var nextWorker = 1;
gateway.on('message', function roundRobin (msg) {
  var workers = cluster.workers;
  var numWorkers = Object.keys(workers).length;
  workers[nextWorker].send(msg);

  if (++nextWorker > numWorkers) {
    nextWorker = 1; // else, it's prefix incremented
  }
});

for (w in cluster.workers) {
  cluster.workers[w].on('message', gateway.respond.bind(gateway));
}

The Workers parse a message, use it to make an HTTP request, and then, using the cluster send API to respond back to the gateway (last block of code above).

The problem

I am getting strange and unpredictable latency patterns when placing the system under load. All CPU/memory/network measurements are sane and don't indicate an infrastructure bottleneck.

The question

As you can see, work is distributed among workers equally, without respect to the actual throughput of a given worker. My hunch is that this is what is causing the latency spikes–somewhere, perhaps, an individual worker is getting backed up.

Is there any way to confirm this, in principle, or empirically? Perhaps it's just wishful thinking, but it seems that the approach should just average out and not need a worker-pull type algorithm. (Which seems especially tricky, as I can't reason which would be the best time to consider a worker to be free – after it's done parsing? after it's received an HTTP response? after it has sent its response to the gateway?)

I just don't know enough about CPU scheduling to know whether I'm chasing a red herring or if this is a poor algorithm which is definitely causing troubles. (And if so, any ideas on how to improve it would be appreciated.)

Steve Bennett
  • 114,604
  • 39
  • 168
  • 219
Luke
  • 145
  • 1
  • 10
  • Have you been looking at [this](https://strongloop.com/strongblog/whats-new-in-node-js-v0-12-cluster-round-robin-load-balancing/) and [this](https://github.com/joyent/node/commit/e72cd41)? and what node version are you on? – Madness Aug 10 '15 at 16:22
  • Because you are going to want to see [this](http://stackoverflow.com/questions/21845303/node-js-cpu-load-balancing) and subsequently [this](http://stackoverflow.com/questions/2387724/node-js-on-multi-core-machines/8685968#8685968) and [this](http://stackoverflow.com/questions/14795145/how-the-single-threaded-non-blocking-io-model-works-in-node-js/14797359#14797359) – Madness Aug 10 '15 at 16:28
  • Yes, thanks, @Madness. I've read both of those. As the SL post says, "The current algorithm for selecting the worker is not very sophisticated. As the name suggests, it’s round-robin – it just picks the next available worker." That's why I did it the way I did (though I'm not sure what "available" means, exactly). Using v0.12.6. – Luke Aug 10 '15 at 16:32
  • Regarding the subsequent posts – I will have to check those out. But, recall the special case I have which is that incoming traffic is a single persistent socket. Cluster will distribute multiple connections on a single _port_ to different workers, but it is unable to natively distribute the load of a single _socket_. – Luke Aug 10 '15 at 16:37
  • Yeah, that sounds like you will need to use the solutions that couple the use of nginx. It has been the goto fallback for all of these issues in Node – Madness Aug 10 '15 at 16:44
  • 1
    `worker.send()` is synchronous, so it may not be the best way to distribute data between your master and the workers. Have you tried putting a Redis queue in between? Master pushes messages to Redis, workers take a message off the queue when they're ready. – robertklep Aug 10 '15 at 16:53

0 Answers0