Possible causes of a deadlock in socket select

Question

I have a jabber server application an another jabber client application in C++.

When the client receive and send a lot of messages (more than 20 per second), this comes that the select just freeze and never return.

With netstat, the socket is still connected on linux and with tcpdump, the message is still send to the client but the select just never return.

Here is the code that select :

bool ConnectionTCPBase::dataAvailable( int timeout )
  {
    if( m_socket < 0 )
      return true; // let recv() catch the closed fd

    fd_set fds;
    struct timeval tv;

    FD_ZERO( &fds );
    // the following causes a C4127 warning in VC++ Express 2008 and possibly other versions.
    // however, the reason for the warning can't be fixed in gloox.
    FD_SET( m_socket, &fds );

    tv.tv_sec = timeout / 1000000;
    tv.tv_usec = timeout % 1000000;

    return ( ( select( m_socket + 1, &fds, 0, 0, timeout == -1 ? 0 : &tv ) > 0 )
             && FD_ISSET( m_socket, &fds ) != 0 );
  }

And the deadlock is with gdb:

Thread 2 (Thread 0x7fe226ac2700 (LWP 10774)):
#0  0x00007fe224711ff3 in select () at ../sysdeps/unix/syscall-template.S:82
#1  0x00000000004706a9 in gloox::ConnectionTCPBase::dataAvailable (this=0xcaeb60, timeout=<value optimized out>) at connectiontcpbase.cpp:103
#2  0x000000000046c4cb in gloox::ConnectionTCPClient::recv (this=0xcaeb60, timeout=10) at connectiontcpclient.cpp:131
#3  0x0000000000471476 in gloox::ConnectionTLS::recv (this=0xd1a950, timeout=648813712) at connectiontls.cpp:89
#4  0x00000000004324cc in glooxd::C2S::recv (this=0xc5d120, timeout=10) at c2s.cpp:124
#5  0x0000000000435ced in glooxd::C2S::run (this=0xc5d120) at c2s.cpp:75
#6  0x000000000042d789 in CNetwork::run (this=0xc56df0) at src/Network.cpp:343
#7  0x000000000043115f in threading::ThreadManager::threadWorker (data=0xc56e10) at src/ThreadManager.cpp:15
#8  0x00007fe2249bc9ca in start_thread (arg=<value optimized out>) at pthread_create.c:300
#9  0x00007fe22471970d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#10 0x0000000000000000 in ?? ()

Do you know what can cause a select to stop receiving messages even if we are still sending to him. Is there any buffer limit in linux when receiving and sending a lot of messages through the socket ?

Thanks

Maybe the other side crashed or closed. I suggest to always have a non-infinite timeout to `select` (e.g. a one second timeout). I also suggest to use `poll` not `select`. And you could `strace` your program to find out how `select` is "erroneously" behaving. — Basile Starynkevitch, Jun 24 '12 at 16:22
The other side (the server) is still running, and still sending messages. With tcpdump, I can see those messages sent to the client port. The reason we don't use timeout in the select is not to overkill the cpu all the time nothing is sent. Also, when I run strace on the client, he is still waiting on the socket. and the socket is still available in /proc//fd/ — ruddy, Jun 24 '12 at 16:28
"With tcpdump, I can see those messages sent to the client port". So are you running tcpdump on server side? What about if you sniff traffic on your local machine?Are those packets arriving? — Heisenbug, Jun 24 '12 at 16:34
The least thing you could do is check select()s return value and if it is -1 check errno. Could be EINTR / EAGAIN or EINVAL ... or anything. — wildplasser, Jun 24 '12 at 16:37
The server and the client are running on the same machine. So the client connects to the server to port 5222 and the client port is like 41919. So when I run tcpdump on interface 'lo', I can see that a packet has been delivered from 5222 to port 41919 — ruddy, Jun 24 '12 at 16:38
The select nevers return -1. I tried to put a timeout (eg: 1 second) and when the bug occurs, the select just reach its timeout (like nothing happened) and we try again and never get the data. — ruddy, Jun 25 '12 at 04:54
Just a thought : if you have only one socket, why not reading from it non-blockingly with the PEEK option ? It would a be much simpler way to see if data is available. — Offirmo, Jun 25 '12 at 14:43

Jirka Hanika · Answer 1 · 2012-06-25T07:15:57.883

There are several possibilities.

Exceeding FD_SETSIZE

Your code is checking for a negative file descriptor, but not for exceeding the upper limit which is FD_SETSIZE (typically 1024). Whenever that happens, your code is

corrupting its own stack
presenting an empty fd_set to the select which will cause a hang

Supposing that you do not need so many concurrently open file descriptors, the solution would probably consist in finding a removing a file descriptor leak, especially the code up the stack that handles closing of abandoned descriptors.

There is a suspicious comment in your code that indicates a possible leak:

// let recv() catch the closed fd

If this comment means that somebody sets m_socket to -1 and hopes that a recv will catch the closed socket and close it, who knows, maybe we are closing -1 and not the real closed socket. (Note the difference between closing on network level and closing on file descriptor level which requires a separate close call.)

This could also be treated by moving to poll but there are a few other limits imposed by the operating system that make this route quite challenging.

Out of band data

You say that the server is "sending" data. If that means that the data is sent using the send call (as opposed to a write call), use strace to determine the send flags argument. If MSG_OOB flag is used, the data is arriving as out of band data - and your select call will not notice those until you pass a copy of fds as another parameter.

fd_set fds_copy = fds;
select( m_socket + 1, &fds, 0, &fds_copy, timeout == -1 ? 0 : &tv )

Process starvation

If the box is heavily overloaded, the server is executing without any blocking calls, and with a real time priority (use top to check on that) - and the client is not - the client might be starved.

Suspended process

The client might theoretically be stopped with a SIGSTOP. You would probably know if this is the case, having pressed somewhere ctrl-Z or having some particular process exercising control on the client other than you starting it yourself.

Thanks for your answer. The client application is not starved. And the process is not suspended. Also, I added the fds_copy to get the exceptions and I never get them and the problem persists — ruddy, Jun 25 '12 at 04:51
@ruddy - Thanks for your systematic research. I edited the answer to add another possibility worth looking at. — Jirka Hanika, Jun 25 '12 at 07:17

Possible causes of a deadlock in socket select

1 Answers1