1

Probably the most asked-about subject in all of stackoverflow world: "Address in in use" errors. There are literally thousands of questions on the topic. This question ask for a specific detail.

The specific detail: how long is an address that has been bind/accept-ed held for after a process terminates abnormally?

I'm writing a HTTP server demon. It accepts socket connections, obviously. If the process is terminated abnormally (either by the debugger, or an actual crash), the socket address on which accept has been called is held for some time before it can be reused (on Debian Linux). Restarting the server immediately, and attempting to bind to the same endpoint results in a "Address already in use" error. The address appears to be reserved by the OS for some period of time on the order of 60 seconds after the process terminates. It may be dependent on whether there was an open client connection when the previous process was terminated.

If anyone can recommend an up-to-date accurate state diagram for TCP/IP connections, I could probably work this out myself. (Please do).

It seems to me that I need to write code in my server to retry the bind/accept operation for some period of time before giving up and exiting abnormally with an actual "Address already in use" error. This seems necessary in order to allow the server to be restarted. The code currently retries for 140 seconds for no other reason than that's a timeout for a bunch of TCP operations; but I'd like the peace of mind of having an actual fact-informed value.

So that's the question. How long do I need to retry for before giving up on an attempt to bind a port used to accept connections? What is the timeout on the reservation of an address that has been previously used to accept socket connections by a process that has been abnormally terminated? Is this possibly an Operating-System-dependent question? The ideal answer would reference a standards-imposed timeout in the TCP/IP socket state diagram.

Robin Davies
  • 7,547
  • 1
  • 35
  • 50
  • 4
    You know about `SO_REUSEADDR`, right? – Steve Summit Jul 15 '21 at 14:30
  • I do. Should I really be using this on an HTTP server connection though? If the socket is legitimately in use, (or if my process is already running), I do want to fail. And.. "if all of the sockets on the same port provide TCP service, ...the port cannot be guaranteed to be handled by the correct socket" – Robin Davies Jul 15 '21 at 14:32
  • 1
    I'm not sure. I use it on all my servers, but perhaps that's a bad idea. (The use case I tend to be interested in is that I've just found a simple bug and killed the server and edited the server code and recompiled, and I don't want to have to run out the rest of the 2 minute clock before testing again. I've wondered whether it's appropriate to leave `SO_REUSEADDR` turned on in production, or not.) – Steve Summit Jul 15 '21 at 14:40
  • Ok. As an #ifdef debug feature, it seems like a good idea. I shall make it so. The debug turn time is killing me. fwiw, it is a definite security risk in production. It allows a process other than yours to randomly intercept half of the inbound connections on your port by also accepting with SO_REUSEADDR (according to Microsoft docs). I would accept an answer to that effect. – Robin Davies Jul 15 '21 at 14:51
  • Don't think I'm entirely following your scenario, although I've run into similar issues and had the same types of questions in the past. I remember one thing I wanted to know was, "how can I tell if this connection is still alive?", and I don't think there's any easy answers. The other side can disappear at any time, and you don't know what you don't know. Perhaps this will be helpful: https://unix.stackexchange.com/questions/386536/when-how-does-linux-decides-to-close-a-socket-on-application-kill – yano Jul 15 '21 at 14:54
  • 1
    Also this: https://stackoverflow.com/questions/31831089/socket-is-open-after-process-that-opened-it-finished – yano Jul 15 '21 at 14:57
  • @yano: Your link answers the question definitively for a client connection. Not so definitively for a server's bind/accept case. The concrete scenario (Steve Summit puts his finger on the problem): you have to wait two minutes to debug a process that has been abnormally terminated because you have to wait for the address to be released). Although I still need the actual timeout for production server use. – Robin Davies Jul 15 '21 at 14:58
  • @RobinDavies My assumption is that in production, the server is up "all the time", so it's not a problem. Unless the OS it's running on reboots, in which case (a) that's probably going to take longer than 2 minutes and (b) the OS's memory that the address was in use is going to be erased, too. Are you trying to meet some High Availability requirements, and worried about your server dying due to circumstances beyond its control, and needing to be restarted ASAP? – Steve Summit Jul 15 '21 at 15:11
  • 1
    Normally reopening a listening socket does not cause a "address in use". What do you see after you stop your process and run `netstat -ant4 | grep ` ? – rustyx Jul 15 '21 at 15:13
  • Steve Summit: the concern is that if you set SO_REUSE_ADDR on your server's socket, then another process can ALSO open a socket on the same address with the SO_REUSE_ADDR option. At that point, which process gets an incoming connection is UNSPECIFIED according to Microsoft doc. Either process can get it. (Windows provides a non-standard socket option to prevent this). I suspect both ports respond, and the client accepts the first response. Haven't found the Linux docs on what happens in this case yet. – Robin Davies Jul 15 '21 at 15:18
  • @rustyx: A socket from a connected client that was connected when the process terminated in TIME_WAIT state. The LISTEN socket is gone. Presumably this conflicts with the bind of of *:8080 *) required to accept. tcp 0 0 192.168.0.26:8080 192.168.0.24:53990 TIME_WAIT – Robin Davies Jul 15 '21 at 15:26
  • Pretty sure I can answer the question given the comments so far. If anyone wants to do so first, I'd be happy to accept your answer. – Robin Davies Jul 15 '21 at 15:28
  • TIME_WAIT is the state you end up in when you abruptly `close` a connection. If you `shutdown` client sockets before closing (i.e. graceful close), then there shouldn't be a TIME_WAIT state. Obviously not possible when killing the server, but possible in a shutdown handler. – rustyx Jul 15 '21 at 16:08
  • @rustyx: Interestingly, the really deadly state is FIN_ACK, with a 120s timeout. So the sockets do have to be shutdown-ed, even if they've been closed. Pretty sure I have never implemented that right. :-( Certainly a rare case; but not a non-existently rare case. – Robin Davies Jul 15 '21 at 16:59

1 Answers1

0

Thanks to multiple commenters who provided hints that allow an answer.

There are actually two issue at hand. The first is that debug turns take at least 60 seconds because server ports can't be reused for at least 60 seconds after terminating a debug session. The second issue is what should production servers do to reliably bind their listening sockets on startup.

The problem is not with unreleased listening sockets; it's with ghost client sockets that are lingering in TIME_WAIT or FIN_ACK state. To listen on a port, you have to bind the listening socket to the address pairs (0.0.0.0:server_port 0.0.0.0:any). This binding will conflict with with lingering client ports which will have address pairs of the form (serveraddress:serverport clientaddress:someport). The listening address cannot be bound until the client ports time out. The implementation requirement is that a socket cannot be bound to an address pair that overlaps an existing socket address pair. And the listening binding does overlap the lingering client binding.

The TCP protocol requires the client sockets to time out in order to prevent old connection requests in flight from being bound to a new listening port. Perhaps this could be done more efficiently, but that is the purpose of preventing listening ports from being bound when old client connections are lingering.

The longest possible time for which ghost sockets can linger is not the TIME_WAIT timeout, but the FIN_ACK timeout which is twice the TIME_WAIT timeout. So servers need to retry the bind for at least two TIME_WAIT timeout intervals.

How long is the TIME_WAIT timeout, you ask? It's operating-system dependent, and there does not appear to be a standard way to find it. On Linux and BSD systems, it's 60 seconds. On Windows systems, and at least one other operating system, it's 120 seconds!, but it's common to configure it to a shorter value on production Windows servers.

So. The unpleasant answer is that Http servers should retry bind/accept operations on startup for at least 120 seconds on Linux/BSD systems, and for at least 240 seconds to be reasonably portable.

@Steve Summit helpfully pointed out that you can avoid the normal 60 second delay by setting the SO_REUSE_ADDR socket option on your server's listening socket. This solves the debugging problem nicely. HOWEVER, doing so on production servers is a SIGNIFICANT SECURITY ISSUE. Non-privileged processes can hijack your existing socket listener by also listening on the same address with SO_REUSE_ADDR enabled (at which point it's a coin toss as to which process will accept the connection). So production servers in release build state should NEVER do this. This is definitely true on Windows systems, and almost definitely true on Linux systems as well.

Robin Davies
  • 7,547
  • 1
  • 35
  • 50