Can a lot of TIME_WAIT bring down a server?

Question

I have read the related question:

What is the cost of many TIME_WAIT on the server side?

But I'm still lost. We have two application servers and a database server (all are virtual machines provided by a cloud service). Today the database server just shut down completely without any warning. We managed to get the cloud service vendor to get it back up online and we restored our application to a working state again.

When questioned about the reason for this, the cloud service vendor returned with a bunch of TCP statistics (around 1500 lines) that look like this (masked for privacy):

ipv4     2 tcp      6 98 TIME_WAIT src=x.x.x.x dst=y.y.y.y sport=z dport=5432 packets=p bytes=b src=y.y.y.y dst=x.x.x.x sport=5432 dport=z packets=p bytes=b [ASSURED] mark=0 secmark=0 use=2

The vendor claims that the server had issues and shut itself down because of too many incoming connections, as evidenced by the high number of TIME_WAIT connections.

However, there was no indication of the time frame in which the statistics were gathered. If they were gathered in a long time-range, the statistics can't be used to claim that there were a large number of such connections.

Such a claim can only be valid for a snapshot statistics done at a particular time-point (not a time-range), where it is evident that a large number of connections are in the TIME_WAIT state at the given point in time. Am I right?

Even if we grant the possibility that there were indeed a large number of TIME_WAIT connections at a snapshot time-point, is this damaging to the server and will it bring the server down to a grinding halt? Is this how a Denial of Service attack happens?

I'm not the network guy. Our network guy says the vendor is trying to push the blame on us by claiming we overloaded the server. Does the statistics show that this happened? — ADTC, Mar 12 '14 at 12:16
If you've already read my answer, please read again. I've added an edit that is an important caveat. — San Jacinto, Mar 12 '14 at 16:52
No pressure to accept the answer (I don't care much about the SO game anymore)... just didn't want you gearing up to combat your vendor without that knowledge. — San Jacinto, Mar 12 '14 at 19:44

San Jacinto · Answer 1 · 2014-03-12T16:51:29.847

Each TIME_WAIT state must be tracked, plain and simple. When a packet comes back in on a TIME_WAIT connection, this state maintenance (think: physical memory used by each connection) is what permits a TCP stack to associate the incoming packet with a connection that has been closed. If it's not a SYN, the packet will be ignored. If it is a SYN, then some (most?) implementations permit a TIME_WAIT assassination.

So simply, yes it's possible to overload the system with too many concurrently-closed connections, as TIME_WAIT lasts on the order of minutes.

Regarding the likelihood of such an attack, yes it's certainly possible. However, it would likely have to be a distributed denial of service (DDOS) not a normal DOS. This is because to put the connection in TIME_WAIT, the connection would have to fully open (SYN + SYN/ACK + ACK) and then close (FIN + FIN/ACK + ACK), and just a handful of machines isn't going to be capable of flooding the server in such a way. But given that opening a TCP connection takes milliseconds and TIME_WAIT typically lasts for minutes, there is a potential problem.

However, much of this leads back to your vendor's TCP implementation. 1500 does not sound like an abundance of TIME_WAIT states and this seems unrelated. If the server is dropping connections due to too many concurrent connections, then you need to get an idea of the active load at that time, not TIME_WAIT. Modern TCP implementations (server-end) won't even create a TCP connection until the SYN/ACK is seen (uses TCP SYN cookies to prevent a DOS). So, there's some missing info here.

Edit:

Though thinking more about this, the lack of a TCP-level problem wouldn't necessarily mean that your vendor is deflecting blame. 1500 TCP connections is very low, but for this specific database, perhaps it is not. Some RDMS's only permit a relatively low number of connections (relative to what the TCP stack can support). This value is entirely RDMS-dependent and can usually be configured.

For instance, I once exceeded the number of permissible concurrent connections to a MySQL server and the server refused to process any more data (you could call it a grinding halt) because I was not properly closing my connections to MySQL. It may be that your database is well able to support more than you're throwing at it, but you're inefficiently using the connections.

Can a lot of TIME_WAIT bring down a server?

1 Answers1