5

We're developing an online game where players communicate with the server using a persistent TCP connection. Persistent as in, its lifetime is that of a player's session, and if the connection is closed, the player is thrown from the game (though the client will attempt to automatically reconnect).

Problem

Now, of course everything works fine in our office (connecting to both testing and live servers), but our client reports that some players get disconnected a lot (every few seconds), and that they experience it themselves too (though their offices are in the same building).

Question

How can I find out the cause of these disconnects? Is it because:

  • Players have bad internet connections and it can't be helped.
  • The distance between players and server (Turkey <-> Netherlands) is too long.
  • Something is wrong with the server (a CentOS machine) or the datacenter.
  • The server is overloaded (though it happens under low loads too).
  • There is an error in our software.
  • Or some other reason?

The software is written in Java. It logs when players are disconnected, and if it actively kicks them (e.g. for not sending keep-alive messages) it logs that too.

Known data

  • Whenever a spurious disconnect is reported and I check the logs, most of the time I don't see that player getting actively kicked by the server software, only see that the connection has been closed.
  • There is an internal monitoring service which has a bunch of localhost connections to the game server, the same way players do, and it doesn't get disconnected.

Others

There are many other online games like ours. How do they deal with this? (Unless the problem is in the server/datacenter, then the solution is obvious)

  • Do they use UDP? I know action games do, for speed, but I presume TCP is normal for e.g. online poker and other slow games? (Not that that would help us, our client software is made in Flash, which doesn't support UDP)
  • Is there some TCP tweaking that can be done to make it more lenient?
  • Or do they get these disconnects as well, just reconnect more transparently?
  • Is there information about this on the web?
Bart van Heukelom
  • 43,244
  • 59
  • 186
  • 301
  • +1 and good luck; I _hate_ debugging networking. Did you know that sometimes clients can end up changing IP address every few minutes? Yes, things get that horrific. Go for resilient reconnects. – Donal Fellows Feb 09 '12 at 11:15
  • I do not thinks that determining the reason will be possible with only server side logs. It may be possible to make the client log when it looses connection and ask your clients to send you the log files (via report a problem button or some other way). – ShaMan-H_Fel Feb 09 '12 at 12:31
  • That's right, there is no generic way to figure out from the server side what causes customer disconnections. What OS is your java app running on and how many simultaneous connections do you typically have? – mac Feb 10 '12 at 14:20
  • @JohnD OS is CentOS Linux 5.5, and we have about 400 connections – Bart van Heukelom Feb 10 '12 at 14:30
  • It could be a firewall issue. Have a look at http://stackoverflow.com/a/4541177/727201 – mac Feb 10 '12 at 15:06
  • Can you monitor the ICMP messages. This might be able to help answer some of the questions. – Brad Semrad Feb 10 '12 at 15:40

1 Answers1

1

I would ask players to allow you to enable "anonymous usage data", like many apps do, to periodically upload debugging information from their sessions back to you. This is how you figure out these sorts of situations.

From there, what you'll need when a disconnect happens, is a pretty verbose log. When the disconnect happens, catch whatever exception was thrown (and don't forget to also log the cause via a call to .getCause() - making as many calls to .getCause() as necessary until you've logged all the way back to the root cause), as well as any relevant data you need to match up the client log with the server-side logs. Information you'll likely need includes like session IDs, game IDs, timestamps, etc. Just think, "What information do I think I would need in order to troubleshoot this, assuming I had insight into both sides of the connection?" which is what you'll ultimately get with asking users to upload usage and debugging data.

From there you should be able to figure out at least a few situations where you have control over it - that is, where you can change your client/server code in order to alleviate some of the problems. In some cases, where the problem is either a client's configuration or faulty equipment (or maybe a piece of equipment in between that neither of your control), you'll have to rely on robust re-connectivity.

You'll never reduce disconnects to zero, but this information, after you see enough cases of it, should help you reduce the occurrence of disconnects to the situations that are outside of your control alone, at which point your power to shape the network will ultimately end, and you'll be as close to a "best case scenario" with network reliability as you can be.

jefflunt
  • 33,527
  • 7
  • 88
  • 126