How to get zero downtime with Socket.io / Node.js server?

Question

I have a Node.js web server running with Socket.io. I found that if one error happens in the script, the entire server crashes. So I'm trying to find a solution to keep the server up and running in cases like this when the app goes into Production. I found one answer that seemed promising, but doesn't solve my particular problem when I tried implementing it on my code: How do I prevent node.js from crashing? try-catch doesn't work

EDIT:

What I fixed so far: I now have PM2 to auto-restart script upon crash, and I now have Redis set up and have my user session data stored in it.

My code is currently set up like this:

EDIT #2: After studying and working on the code all day and edited the code slightly a second time to include "sticky-session" logic. After editing code, there are no longer strange sockets connection every 1 second and it seems like (I'm not completely sure though) the sockets are all in sync with workers. When the script crashes, the app (not PM2) spawns a new process, which seems good. However when a worker crashes, users still have to refresh the page again to refresh their session and get new sockets, which is a big problem...

var fs = require('fs');
  https = require('https'),
  express = require('express'),
  options = {
    key: fs.readFileSync('/path/to/privkey.pem'),
    cert: fs.readFileSync('/path/to/fullchain.pem')
  },
  cluster = require('cluster'), // not really sure how to use this
  net = require('net'), // not really sure what to do here
  io = require('socket.io'),
  io_redis = require('socket.io-redis'), // not really sure how to use this
  sticky = require('sticky-session'),
  os = require('os');
  var numCPUs = os.cpus().length;
  var server = https.createServer(options,app, function(req, res) {
    res.end('worker: '+cluster.worker.id);
  });

if(!sticky.listen(server, 3000) {
  // Master code
  for(var i = 0; i < numCPUs; i++) {
    cluster.fork();
  }
  server.once('listening', function() {
    console.log('server started on port 3000');
  });
}
else {
  // Worker code
  var 
    io = io(server),
    io.adapter(io_redis({host: 'localhost', port: 6379})),
    getUser = require('./lib/getUser'),
    loginUser = require('./lib/loginUser'),
    authenticateUser = require('./lib/authenticateUser'),
    client = require('./lib/redis'); // connect to redis

  client.on("error", function(err) {
    console.log("Error "+err);
  });

  io.on('connection', function(socket){
    // LOTS OF SOCKET EVENTS / REDIS USER SESSION MANAGEMENT / APP
  });

}

I tried using "cluster", but I'm not sure how to get it working properly, since it involves multiple "workers", and I believe the sockets get mixed up between. I'm not even sure what parts of my code ("require" functions, etc) go in which "cluster" code blocks (Master/Worker), or how to keep the sockets in sync. Something just isn't right.

I'm assuming I need to use npm package socket.io-redis and/or sticky-session to keep the sockets in sync? (not sure how to implement this). Unfortunately, there just aren't any good examples on the internet or in the books I'm reading for clustering socket.io with node.js

Can someone provide a basic code example on which parts of my code go where, or how to implement things? I would greatly appreciate it. The goals are:

1) If the server (node cluster process) crashes, the sockets should still work after restart (or another worker spawns).

For example, if two users (two sockets) are having a private message conversation and then a crash happens, the messages should still be delivered after PM2 auto-restarts (spawns a new cluster process) after crash. The problem I have: If the server crashes, messages stop getting sent to users even after an auto-restart.

2) Sockets should all be in sync together with different cluster processes.

There's no magic here. You handle errors anywhere they can happen. Using promises with all async operations make it a lot easier to catch async exceptions. You may want to consider using something like redis for your user sessions as that would allow them to persist across a node.js restart and would also solve issues related to clustering. — jfriend00, Jan 24 '18 at 06:58
Thanks for the redis suggestion, I've installed it and have been studying it for the past few days. Unfortunately, I doubt I will be able to account for and handle every possible error. I've done a good job so far, but in the real world it's going to crash for strange and unusual reasons. Assuming I get redis working, I'm hoping there is still a way to implement a Cluster/Domain solution to prevent downtime. — peppy, Jan 27 '18 at 01:00
Use clustering with redis and with an auto-restart monitoring tool like forever and there should be no downtime when one server process gets restarted. You are correct that you cannot possibly account for all possible errors and no domain type solution can make sure your server can continue properly after some error either. So, you have to plan for a restart without downtime which cluster + redis + auto-restart tool can give you. — jfriend00, Jan 27 '18 at 01:39
I am pleasantly surprised at how nice and easy redis is. I was able to completely replace my global userSession variable with redis, and I have PM2 set up as an auto-restart, and it works great. I'm still having problems with cluster. I'm trying to figure out how to make multiple processes, but keep all the sockets in sync between them (I'm assuming something to do with socket.io-redis?). If a server does crash and restart, I would like to keep the sockets working. If 2 members are having a private chat and server crashes/restarts, I would like messages to keep going. Will edit post above. — peppy, Jan 31 '18 at 21:45

score -1 · Answer 1 · answered Feb 01 '18 at 00:37

-1

How to get zero downtime with …

You don't.

It's simply not possible with anything. You're asking the wrong questions. Try these:

How do I catch and handle errors I can predict?
How do I gracefully fail when there are errors I cannot predict?
How can I usefully separate errors in my application vs. errors in how clients interact with it?
How can I build a distributed system?
How do I deploy and scale a system with fault tolerance in-mind?
I have [single point of failure XYZ], how do I distribute [XYZ] to remove it?
What systems monitoring is useful for [some technology]?
How do I set up automation for [recurring problem X]?

etc. etc.

answered Feb 01 '18 at 00:37

Brad

159,648
54
349
530

Thanks a lot for the tips Brad, it's greatly appreciated. I think I understand this now. The last thing I need to fix is a socket-specific issue. I'm running a chat room / private message script. When the server crashes and restarts, I would like a near-seamless solution so users can keep their chat windows open and continue chatting with each other without forcing them to refresh their browsers (keep the sockets connected after server restart - or keep the sockets running if one node worker process goes down. Hopefully there is an easy solution to this. – peppy Feb 01 '18 at 02:43

How to get zero downtime with Socket.io / Node.js server?

1 Answers1