How do I determine what's resetting my connection?

Question

I have a client and server, based on TcpListener and TcpClient. The client connects to the server and they exchange some data. Everything works just fine when I run locally.

But when I put the server in a Docker container on Azure Container Services, and connect the client to it, the following happens:

Client connects successfully to server
Client and server perform successful handshake
Data transfer begins
Approximately 20 seconds later (this is supposed to take several minutes) the whole thing blows up. The server reports "connection reset by peer" and the client reports "error reading past the end of the stream."

Each side seems to think the other side is the one with the problem. When I'm running locally, everything works as expected, which leads me to believe that the problem is somewhere in between.

There isn't a fundamental issue with establishing the connection, such as a firewall getting in the way, because I've verified at both ends that they're connecting and performing the handshake. The client is not "slamming the phone down"; it's expecting more data from the server. But "connection reset by peer" means that someone somewhere is intentionally sending a RST packet.

Is there any good way to figure out what's interfering with my data transfer?

The key is **connection reset by peer**. That means one side decided to drop the connection. Since your client is also reporting **error reading past the end of the stream** that points me to the Internet router (cable modem) or Antivirus as the first culprits. What type of network/Internet/router do you have set up on the client side? When I see similar issues from a home Internet, I first reboot my Internet router and reboot my computer. Then I double-check if Antivirus is the source of the problem which I have seen many times. — John Hanley, Nov 27 '21 at 01:11
This seems to be for dba.stackexchange.com or serverfault even — Leandro Bardelli, Nov 27 '21 at 01:34
@JohnHanley Would a router or antivirus allow me to connect and only interrupt things after several seconds have gone by? — Mason Wheeler, Nov 27 '21 at 13:39
Yes. Home routers often have bugs, run out of memory, get reset by the ISP or errors on the fibre/cable, etc. Antivirus can monitor traffic for behavior and then decide to block the connection. Anything in the route from you to the host can interrupt your connection. — John Hanley, Nov 27 '21 at 17:34
You can chech what's going on on a lowest level with Wireshark. Maybe there is something going on in the network. — Sasha, Nov 30 '21 at 05:52
Why don't you create a connection, then send something from client to server, see if its readable, then the other way around. See where the error lies. Write the result into a log file. — Charles, Dec 02 '21 at 04:55
@Charles I'm doing that. It works just fine. That's what's so frustrating about this! There is no visible cause to it! — Mason Wheeler, Dec 02 '21 at 13:00
If your server and client think there is something in between. What is there in between? nginx controller? load balancer? You could always spin up a mock environment, where there is nothing in between, and no risk of your production environment being impacted. Container has direct access from public internet. Create an isolated environment and remove all things in between. You need to start deducing things to root out the issue. Cos we need more to troubleshoot with from remote. — Marco, Dec 02 '21 at 15:00
@MasonWheeler have you tried the http header Connection: keep-alive? On client side you should handle Timeout too on the TCPClient. — Marco Di Scala, Dec 06 '21 at 14:09
@MarcoDiScala This is not a HTTP connection. As for timeout, it looks like if it times out waiting for data, it will throw a different exception on the client-side from what I'm seeing. — Mason Wheeler, Dec 06 '21 at 14:16
@MasonWheeler I refer to the [TCP Keep Alive](https://learn.microsoft.com/en-us/dotnet/api/system.net.sockets.socketoptionname?view=net-6.0#System_Net_Sockets_SocketOptionName_KeepAlive) option. — Marco Di Scala, Dec 06 '21 at 15:07

Danut Radoaica · Answer 1 · 2021-12-02T13:50:38.040

0

For Azure Container Services (be it Azure Container Instances or Azure Kubernetes Service), the major cause for intermittent connection issues is hitting a limit while making new outbound connections. The limits you can hit include:

TCP Connections
SNAT ports

Please see:

Detecting SNAT port exhaustion on Azure Kubernetes Service
Troubleshooting intermittent outbound connection errors in Azure App Service (even if it is about Azure App Service most of them still apply)

More info:

edited Dec 02 '21 at 13:50

answered Dec 01 '21 at 20:02

Danut Radoaica

1,860
13
17

1

Good information links, but these would not apply to connections already connected. – John Hanley Dec 01 '21 at 20:04
I had a similar situation with communication problems between Apache Ignite server in Azure Kubernetes and thin clients in Azure Web Apps. The reset by peer appeared in the server and the client side at the same time. The only solution was retry logic with exponential backoff on the client side. I concluded it was a problem in the Azure Load Balancer, maybe i was to aggressive with the calls – Danut Radoaica Dec 01 '21 at 20:18
This is not making a new outbound connection; it's a server receiving an inbound connection, only to have it terminated several seconds later. – Mason Wheeler Dec 01 '21 at 22:42
For inbound the problem may come from the Azure Container Instances load balancer idle timeout. If a period of inactivity is longer than the timeout value, there's no guarantee that the TCP or HTTP session is maintained between the client and your cloud service. When the connection is closed, your client application may receive the following error message: "The underlying connection was closed: A connection that was expected to be kept alive was closed by the server." A common practice is to use a TCP keep-alive. This practice keeps the connection active for a longer period. – Danut Radoaica Dec 01 '21 at 22:51
If the server is reaching the tcp limits, even the established inbound may be affected. In my example the Apache Ignite server received and established an inbound connection and after some seconds it throwed "Connection reset by peer" while the thin clients throwed "Unable to read data from the transport connection: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.." – Danut Radoaica Dec 01 '21 at 23:00
@DanutRadoaica That's unlikely, unless it's using an odd definition of "inactivity." Basically, the client connects to the server and immediately starts downloading a stream of data that's meant to continue until it finishes, then it should disconnect gracefully. – Mason Wheeler Dec 01 '21 at 23:15
@MasonWheeler The message "Connection reset by peer" indicates that the remote server sent an RST to forcefully close the connection, either deliberately as a mechanism to limit connections, or as a result of a lack of resources. Either way you are likely opening too many connections, or reconnecting too fast. – Danut Radoaica Dec 01 '21 at 23:28
But in your case the "Connection reset by peer" is on the server side. That means that the client has sent the RST to the server because the client received an invalid package (see the link from the "More info" from my response) – Danut Radoaica Dec 02 '21 at 00:01
@DanutRadoaica That "more info" link appears to be full of a bunch of Kubernetes-specific stuff that's all Greek to me. I'm not using Kubernetes. It's one container in one Container Group. The client is not sending a reset. Everything works when I run the exact same software locally. Somewhere between me and ACS, there's a man in the middle breaking my connection, and I want to figure out who it is and why. – Mason Wheeler Dec 02 '21 at 13:02
It is Kubernetes-specific, but the scenario is valid also for Azure Container Instances. Something bad (bad network, bad packages, etc) ends up from the server to the client on the established inbound, and the client sends the RST to the server, there is no other way (from my point of view) to recive the "Connection reset by peer" on the server side. I added more links with the same solution: avoid invalid package to reach the client – Danut Radoaica Dec 02 '21 at 13:50
The client is not sending the reset. *The client is not sending the reset.* **The client is not sending the reset!** I don't know how else to say this; I've been quite clear that something other than the client is doing this. The client is erroring out saying it's expecting more data. The problem is in between the client and the server, and continuing to tell me to look at things that can cause the client to send a reset is actively counterproductive. – Mason Wheeler Dec 02 '21 at 14:34

How do I determine what's resetting my connection?

1 Answers1