0

I have a simple server, written in C, that accepts sensor and status information from a variety of sources, then consolidates and reformats it into a stream of ASCII text lines for clients. The clients connect via a listener socket, then read and do whatever with the stream of messages until the user closes the application. Since this is a one-way protocol, the server never bothers to check for pending received data.

Whenever there's a message to be sent to all active users, it goes through a simple loop:

bufflen = strlen(tcp_buff);
for (next_client_ix = 0; next_client_ix < MAX_TCP_CONNECTIONS; next_client_ix++)
    if (TCP_client_sd[next_client_ix] != 0)
        {
        rc = send(TCP_client_sd[next_client_ix], tcp_buff, bufflen, MSG_NOSIGNAL);
        if (rc != bufflen)
            {
            errno_hold = errno;
            s = inet_ntoa(Tcp_client_sin[next_client_ix].sin_addr);
            remote_port = htons(Tcp_client_sin[next_client_ix].sin_port);
            sprintf(log_buff, "Error %d (%s) sending alert to %s:%d. Closing\n", errno_hold, strerror(errno_hold), s, remote_port);
            log_message(SB_ALERT_TEXT_ERROR, log_buff);
            close(TCP_client_sd[next_client_ix]);
            TCP_client_sd[next_client_ix] = 0;        // Free the socket for the next client
            }
        }

This worked fine for years, on versions of Ubuntu from 10.04 through 16.04, when we typically had only 1 or 2 (occasionally 3) clients active at a time, all connected via the Ethernet LAN. Lately, we've been running more clients (still single-digit count) at a time, and most of the increase is copies of a Windows client, usually connected to the LAN via SOHO WiFi routers. This also showed up last month when we had a client connecting remotely from a conference hall with public WiFi.

Once every week or two, the server stops sending to all clients for a few minutes. When I investigate with netstat, I find one or (often) more sockets stuck in CLOSE_WAIT, with a Recv-Q of 1, and about 13K in the Send-Q. Eventually, the server spits out an error message saying it's closing a client connection due to an errno of 32 (Broken Pipe), and everything goes back to normal.

I'm guessing that there's some quirk in the Windows-via-WiFi connection that's causing the connection close sequence to happen differently, but that's a not-very-educated guess.

My question (finally!) is what I should do to either detect the approaching problem before it turns into a server hang, or get Linux to give me an immediate error instead of making me wait while it decides to give up. I've found a variety of ideas for servers expecting incoming data from their clients, but nothing for "write-only" connections (well, one answer was to run netstat before every write, and analyze its output, but that's not really practical for a system we hope to have feeding data from hundreds of sensor arrays to dozens of clients when it goes into full-scale production). I've tried adding some code to try to detect it using the Linux-only SIOCOUTQ fcntl looking for data piling up in the transmit queue, but haven't been able to get a good test because it happens so rarely in the wild. And my attempt to make a misbehaving client didn't go well, because client-side Linux cheerfully stacks up enough data in its receive queue to keep it from failing for several days. So the server never sees a build-up on its side.

Is there some socket or API call option I've missed that will say "Forget patience and retries: give up and fail NOW!"? Should I be patient, and wait a few weeks to see whether my SIOCOUTQ fix has solved the problem? Or do I need to refine my google keyword selection skills to find an answer that's eluded me so far?

Thanks,

Ran

  • I had one hang with "heaps" of CLOSE_WAIT connections, but the recent incidents had only one. Which compounds the confusion: if writing to a single "not ready for more data" socket can cause a hang, how did I manage to get 8 sockets in CLOSE_WAIT? – Ran Talbott Jul 09 '18 at 14:05

1 Answers1

0

I'm assuming you aren't using non-blocking sockets or an SO_TIMEOUT.

That send call could hang for a long time because of a misbehaved client. Imagine if I wrote a client that connected to your server but never called recv on my client socket. Literally this:

int result = connect(sock, addr, addrlen);
while (1) {
    sleep(1);
}

After a sufficient number of send calls to my client, the TCP pipe would get backed up, and your send call could literally block forever. Hence, no other send calls other clients could take place until the previous one completed or error'd out. Such is the nature of single-thread servers and blocking sockets.

A more likely case is if a client connects to your server, then suddenly loses network connectivity. That could also hang your server for several seconds.

Consider any or all of the following for updating your server:

  • non-blocking sockets - And handling the case where send returns a value indicating partial data was sent. You could also poll the socket with recv just to see if the remote client has exited or initiated a 1-way shutdown.

  • Each client gets its own thread and message queue. When the server has something to send, it puts a copy of the data bytes into each client's message queue. Each thread is responsible for sending. A badly behaving client associated with one thread won't stop the other threads from sending.

  • SO_LINGER. You could try setting the linger time to zero on each socket to see if that helps.

selbie
  • 100,020
  • 15
  • 103
  • 173
  • I tried writing a misbehaving client just like that to test my possible fix. What I didn't realize was that the Linux it ran on has a _huge_ (multiple days' worth) limit on its receive queues, so I aborted the test rather than risk a hard server failure from the new code when I wasn't around to recover. – Ran Talbott Jul 09 '18 at 14:26
  • I think you have the right idea with changing to non-blocking sockets: the volume of data is low enough that a partial send will probably only happen in my "problem" situation anyway, so I don't need complicated recovery code. I'll give that a try. Thanks. – Ran Talbott Jul 09 '18 at 14:34