I have a simple server, written in C, that accepts sensor and status information from a variety of sources, then consolidates and reformats it into a stream of ASCII text lines for clients. The clients connect via a listener socket, then read and do whatever with the stream of messages until the user closes the application. Since this is a one-way protocol, the server never bothers to check for pending received data.
Whenever there's a message to be sent to all active users, it goes through a simple loop:
bufflen = strlen(tcp_buff);
for (next_client_ix = 0; next_client_ix < MAX_TCP_CONNECTIONS; next_client_ix++)
if (TCP_client_sd[next_client_ix] != 0)
{
rc = send(TCP_client_sd[next_client_ix], tcp_buff, bufflen, MSG_NOSIGNAL);
if (rc != bufflen)
{
errno_hold = errno;
s = inet_ntoa(Tcp_client_sin[next_client_ix].sin_addr);
remote_port = htons(Tcp_client_sin[next_client_ix].sin_port);
sprintf(log_buff, "Error %d (%s) sending alert to %s:%d. Closing\n", errno_hold, strerror(errno_hold), s, remote_port);
log_message(SB_ALERT_TEXT_ERROR, log_buff);
close(TCP_client_sd[next_client_ix]);
TCP_client_sd[next_client_ix] = 0; // Free the socket for the next client
}
}
This worked fine for years, on versions of Ubuntu from 10.04 through 16.04, when we typically had only 1 or 2 (occasionally 3) clients active at a time, all connected via the Ethernet LAN. Lately, we've been running more clients (still single-digit count) at a time, and most of the increase is copies of a Windows client, usually connected to the LAN via SOHO WiFi routers. This also showed up last month when we had a client connecting remotely from a conference hall with public WiFi.
Once every week or two, the server stops sending to all clients for a few minutes. When I investigate with netstat, I find one or (often) more sockets stuck in CLOSE_WAIT, with a Recv-Q of 1, and about 13K in the Send-Q. Eventually, the server spits out an error message saying it's closing a client connection due to an errno of 32 (Broken Pipe), and everything goes back to normal.
I'm guessing that there's some quirk in the Windows-via-WiFi connection that's causing the connection close sequence to happen differently, but that's a not-very-educated guess.
My question (finally!) is what I should do to either detect the approaching problem before it turns into a server hang, or get Linux to give me an immediate error instead of making me wait while it decides to give up. I've found a variety of ideas for servers expecting incoming data from their clients, but nothing for "write-only" connections (well, one answer was to run netstat before every write, and analyze its output, but that's not really practical for a system we hope to have feeding data from hundreds of sensor arrays to dozens of clients when it goes into full-scale production). I've tried adding some code to try to detect it using the Linux-only SIOCOUTQ fcntl looking for data piling up in the transmit queue, but haven't been able to get a good test because it happens so rarely in the wild. And my attempt to make a misbehaving client didn't go well, because client-side Linux cheerfully stacks up enough data in its receive queue to keep it from failing for several days. So the server never sees a build-up on its side.
Is there some socket or API call option I've missed that will say "Forget patience and retries: give up and fail NOW!"? Should I be patient, and wait a few weeks to see whether my SIOCOUTQ fix has solved the problem? Or do I need to refine my google keyword selection skills to find an answer that's eluded me so far?
Thanks,
Ran