14

The documentation for the various client/target/elb reset count metrics (TCP_Client_Reset_Count, TCP_Target_Reset_Count, TCP_ELB_Reset_Count) just says they count RST packets. I tried to understand what a RST packet is, and it seems to have to do with broken TCP connections. My load balancer has a single, long-term, seemingly successful client connection. Why do I see on the order of 100 client resets per hour? I also see about 10 load balancer resets per hour, and 0 target resets.

EDIT: I just observed that increasing the size of the server instance (I'm using Farscape--increased 0.25 vCPU to 0.5) led to a 10-fold reduction in client resets per hour. The number of load balancer resets did not change.

Aleksandr Dubinsky
  • 22,436
  • 15
  • 82
  • 99

3 Answers3

5

My hunch is that this is related to a bug in the Network Load Balancer that causes it to send 100x as many health checks as it should. See: NLB Target Group health checks are out of control My theory is that a bug causes the health check connection to be broken in an unclean way if the target instance is not quick enough. These broken health check connections get reported as "client resets" even though they should be reported as "ELB resets" or not reported at all.

Aleksandr Dubinsky
  • 22,436
  • 15
  • 82
  • 99
4

There are many reasons for an TCP RST to be sent. Some are not normal, meaning errors, and some are normal connection cleanups that the TCP/IP stack or application performs.

An example of a normal TCP RST would be a long lived connection that exceeds some time limit imposed by one side or the other. Once the time limit is exceeded the connection can be "forceably" closed which will generate the RST.

An example of a not normal TCP RST would be an application that abruptly disconnected due to an internal error.

A poorly written application can also cause TCP RST when it does not perform graceful shutdowns on the TCP socket before closing the connection.

I will guess that the behavior you are seeing is not a problem. However, to really know, you will need to do a wire trace and protocol analysis on each connection to determine exactly what is happening.

John Hanley
  • 74,467
  • 6
  • 95
  • 159
  • Like I said, there is one client connection, and it is stable. The new connection metric (NewFlowCount) stays at 0. Can a RST occur without breaking the connection? `TCP_Client_Reset_Count` is "The total number of reset (RST) packets sent from a client to a target." Can this be spam traffic? But I would guess that spam connections that arrive at dead ports on the ELB would go towards `TCP_ELB_Reset_Count`, "The total number of reset (RST) packets generated by the load balancer." – Aleksandr Dubinsky Mar 28 '18 at 21:32
0

One of the reasons for load balancer reset counts might be higher is because of the network load balancer have an ideal time out value which is 350 seconds. So if your TCP connection does not get any acknowledgment back until the time out load balancer will forcefully close the connection.