0

I have, like many, been delving into the subject of testing whether TCP sessions are active/alive. It seems like an unnecessarily difficult problem with too many half-effective solutions. A connection doesn't know anything until it tests itself. Then attempts to send may succeed despite the connection actually being lost. Polling seems to deliver false positives for connection. Some servers are configured to not respond to pings. The only real test seems to be in trying to make a fresh connection and sensing whether the attempt was successful. This seems unnecessarily heavy-handed, but it seems a but crazy that the protocol doesn't have a lightweight way of answering the question of 'in this particular instant, is it possible to transfer data from client to server and verify that it was received?'

I am working using the .net framework and the exposed TCP objects within it. When disconnecting the network cable, surely this would create an immediate signal to all consumers that the connection was lost. This isn't the case however and nothing I can sense about the connection is aware of this loss. Only trying to re-establish the connection discovers that the physical link has been broken.

What am I missing?

J Collins
  • 2,106
  • 1
  • 23
  • 30
  • How lightweight does it have to be? The most reliable way I've seen this done is with a good ol' request and response. You have to send data and once the server sends something back, you know you're good to go. Even with long-standing Websocket connections there's a ping/pong process that is constantly happening. – arjabbar Sep 15 '20 at 13:53
  • The question you can attempt to answer is "a very short time ago, was it possible to transfer data from client to server" - that's the nature of networks, independent machines, etc. You can never answer *in advance* whether your next action will be able to succeed. – Damien_The_Unbeliever Sep 15 '20 at 13:53
  • And ping tells you "this machine responds to echo requests", *not* "this machine is capable of processing my next request" – Damien_The_Unbeliever Sep 15 '20 at 13:54
  • All good points, communication occurs over time and there is no certainty apart from what has already happened. It's the fact that the connection seems so hard to test, something that I would assume is a fundamental part of what it means to be a protocol. Somehow Chrome knows when I hit F5 that the wired connection has been lost, but a TCPClient object cannot. – J Collins Sep 15 '20 at 14:30
  • @arjabbar That implies the client and server have an agreed ping and response at the data level, which also implies the developer can control both ends. This all to establish a fact that surely the protocol itself is responsible for? – J Collins Sep 16 '20 at 08:04

1 Answers1

2

TCP doesn't really work they way you seem to think it does, although there are some things we can do to make it work better for you. But first let's understand a little better how it works and why you see the behavior you do.

When you open a TCP connection, TCP uses a 3-way handshake to set up the connection. The client sends a SYN, the server responds with SYN+ACK, and then the client sends back an ACK. If neither side tries to send anything the connection will just sit there idle. You can unplug the cable from your machine. A tree can fall and take out your internet service. The internet provider can come repair your internet service, and you can plug the cable back into the ethernet port. And then the client can write to the socket and it should be delivered to the server. (Firewalls unfortunately deliberately break standards, and your firewall may have decided to time out the connection while you were waiting for your ISP to fix your service.) However, if you tried to make another connection while the cable was unplugged, TCP would try to send a SYN, and most likely discover that there is "no route to host." So it can't set up a new connection.

If you had tried to write to the socket while your internet service was out, TCP would try to send the data and wait for an ACK from the server. After a retransmission timeout, if it hasn't received an ACK, it will try again and exponentially back off on the timeout. After typically 15 tries it will give up, which would typically take anywhere between half an hour to an hour and a half.

As you can see, TCP is trying to be resilient in the face of failure, whereas you want to learn about failures very quickly. Systems that need to react quickly to connection failure (such as electronic stock exchanges which typically cancel open orders on connection failure) handle this as part of a higher level protocol by sending heartbeat messages periodically and taking action when a heartbeat is sufficiently overdue.

But if you can't control the protocol, there are some socket options you can use to improve the situation. SO_KEEPALIVE causes TCP to periodically send keepalive packets and it will eventually time out depending on the settings of TCP_KEEPIDLE, TCP_KEEPINTVL, and TCP_KEEPCNT. TCP_USER_TIMEOUT allows you to set a timeout for how long data written to a socket can remain unacknowledged.

How exactly these two options work and interact are implementation dependent, and you have to consider what is going to happen when there is no unacknowledged data, when there is unacknowledged data, and when there is a slow consumer resulting in a zero window. In general it is advisable to use them together with TCP_USER_TIMEOUT set to (TCP_KEEPIDLE + TCP_KEEPINTVL*TCP_KEEPCNT) * 1000 to get a consistent result.

Our friends a Cloudflair have a nice Blog entry about how exactly these work together, but on Linux, unfortunately. I'm not aware of anything as comprehensive as this for Windows.

JimD.
  • 2,323
  • 1
  • 13
  • 19
  • Let's be honest I came into this project with no idea of how TCP works, despite having taken some time in the past to understand networking. Ideally I could just ping the server I am trying to connect to and if that fails, know the connection has gone bad. The server rejects pings so I looked at the TCP layer, since I have no control over the data protocol. I'm just surprised that for all the clever complexity of TCP, a simple 'how's it going', in the form maybe of a repeated handshake, isn't available. Thanks for the great answer though, Windows seems to hard code those settings. – J Collins Sep 17 '20 at 17:51
  • @JCollins I'm pretty sure you can set these on windows. For keep alive check out the socket options KeepAlive, TcpKeepAliveInterval, TcpKeepAliveRetryCount, and TcpKeepAliveTime in the docs for SocketOptionName Enum here: https://learn.microsoft.com/en-us/dotnet/api/system.net.sockets.socketoptionname?view=netcore-3.1. For user timeout, the following claims it works but the enum isn't in the headers: https://stackoverflow.com/a/12948084/7218127 – JimD. Sep 17 '20 at 19:11
  • The documents I read suggested that some of those parameters were globlal, set in the registry. Staying in the .net framework walled garden I can't see how to set retry count, it was the key one that was in the registry. The user timeout looked good but as you say wasn't in the enum, which meant maybe breaking out into unmanaged. For now I'm working on a workaround. – J Collins Sep 17 '20 at 19:28
  • There are global setting for the default, but setting a socket option on a socket only affects that socket. Also, I'm not sure I believe the person who said user time out worked even though it isn't defined in the headers because he #defines the option to be the same value used by linux which would be a weird coincidence if it somehow happened to be the same on windows. – JimD. Sep 17 '20 at 23:20