0

I am having problem with select() call on an multi-socket app.

Here is how it is supposed to work.

Writer writes: [0010]HelloWorld on a socket, where the the first 4 character are always digits representing the payload size.

Reader should do the following:

  1. call select() to verify if a given socket is readable, then read first 4 character, change char to digit to get size of buffer in number, and allocate a buffer of that size.
  2. copy characters (after the first 4 characters) from the socket and paste to the buffer for further processing
  3. read 4 characters again, which should fail and upon failure to read any data, the app should clean exit the program.

Problem is in the 3rd select call. select followed by read iteration, every time we check select() for readability of socket and once that is verified, we proceed with the read. While the socket is valid and almost whole process works just fine, except for last point at step 3 before read is expected to fail, I call select() system call for last time, and it completely freezes the thread upon calling select .

I am not finding any sources online which can explain this weird phenomenon. To verify that the thread is not returning I have created a dummy object just before making the system call select() and logged it on destruction. Unformtunately, the distructor is never getting called.

Source code is propriotery, cannot be shared.

snippet:

fd_set f_set;
int err = 0;
while(check_edge_case())
{
  if(time_out_valid())
  {
     int nfds = GetFileDescriptor();

     FD_ZERO(&f_set);
     FD_SET(nfds, &f_set);

     for_each (clients, [&](client_t &client)
     {
       int fd = client.descriptor;
       if(fd)
       {
         FD_SET(fd, &f_set);
         nfds = std::max(nfds, fd);
       }
     });
    
    ++nfds;
    
    // perform select
    if(time_out_valid())
    {
      struct timeval_t time_val = GetTimeOut();
      err = select(nfds, &f_set, 0,0, &time_val);
    }
    else
    {
      err = select(nfds, &f_set, 0, 0, 0); // blocks in this statement
    }
    
    // check for error
    if(!err)
    {
      err = 0;
      continue;
    }
    else if (err = -1 && errno == EINTR)
    {
      err = 0;
      continue;
    }
    else if (err < 0)
    {
      retunr errno;
    }
    
    if(FD_ISSET(GetFileDescriptor(), &f_set))
    {
      return ECANCELED;
    }
    
    // further processing for read operation
    bool executed = false;
    for_each (clients, [&](client_t &client)
     {
       int fd = client.descriptor;
       if(FD_ISSET(fd, &f_set))
       {
         if(client.f_read) err = client.f_read();
         else err = 0;
         
         executed = true;
         break;
       }
     });
    
    if(!found) return ENODATA;
    
    
  }
}
kishoredbn
  • 2,007
  • 4
  • 28
  • 47
  • 1
    I've never had a similar experience with `select`. Can you show a [mre]? – Ted Lyngmo May 24 '21 at 21:47
  • @TedLyngmo Sure. Let me edit the post and update it in a while. – kishoredbn May 24 '21 at 21:50
  • 2
    "freezing".. or blocking? Are you sure there's data available to read and you're correctly resetting the read list before every call? Are you using a timeout? `select` will block if none of its file descriptor lists can perform non-blocking ops and there's no timeout. Sounds like a strange architecture to me that you're relying on a failed `select` call for proper app operation. – yano May 24 '21 at 21:55
  • @yano It looks like blocking should be a correct term here. There is not data available to read in the 3rd iteration. I am setting the read list correctly, because in the 1st and 2nd iteration the read works fine, following the same loop, in the same thread. Is `select` supposed to block when no data available? – kishoredbn May 24 '21 at 21:59
  • 2
    yes, `select` will block if it has no work to do. You can give it a timeout (eg, "wait for this long before giving up") if you don't want it to block indefinitely. Read the [`select` man page](https://man7.org/linux/man-pages/man2/select.2.html). `select` will block indefinitely if there's no timeout and no file descriptor ready for ops. If there is a timeout, it will block for approximately that long before returning. The `timeout` section has more details. – yano May 24 '21 at 22:06
  • 1
    Without code, we can only speculate. Shooting in the dark I can only speculate that on read reads the content of two writes which leave the third read with nothing left to read. – doron May 24 '21 at 22:10
  • You may want to re-think your design/architecture, particularly if using TCP. My typical `select` usage has been `while(running){ ... select(.../*some small millisecond timeout*/); }`, and if I want to exit cleanly, I set `running=false` somewhere else. Depending on exactly 2 successful `select` reads doesn't sound particularly robust, especially for TCP which can deliver data to you in any sized chunks. But, if that's how things work, you might as well quit after the 2nd successful `select` read, no point in calling it a 3rd time if there's nothing to read. – yano May 24 '21 at 22:16
  • @doron updated with code snippet – kishoredbn May 24 '21 at 22:31
  • @TedLyngmo Updated with code snippet. – kishoredbn May 24 '21 at 22:31
  • @yano, it makes sense, but the problem it that it is a thrid-party lib code which I am using in my code. I can take a look and play around with the time out logic as for now, I have using `select(nfds, &f_set, 0, 0, 0);` 0 timeout, which should return immidiately. – kishoredbn May 24 '21 at 22:33

3 Answers3

2

You are passing the wrong timeout parameter to your second call to select, which is therefore blocking.

From the documentation:

If both fields of the timeval structure are zero, then select() returns immediately. (This is useful for polling.)

If timeout is specified as NULL, select() blocks indefinitely waiting for a file descriptor to become ready.

So pass the address of a zeroed timeval struct, rather than 0, and you should be OK.

Paul Sanders
  • 24,133
  • 4
  • 26
  • 48
1

@PaulSanders gave the correct answer, just thought I'd explain it a little more. Your select(nfds, &f_set, 0, 0, 0); call is passing a 0 for the timespec argument. This is not the same thing as a zeroed out timespec. The fifth argument of select is a pointer to a timespec, so when you put 0 there, it's read as a NULL pointer constant, equivalent to a NULL pointer (you can read more about that here). As the man page says, if the timeout is NULL (that is, the address of a valid timespec is not passed in), then select will block indefinitely, and this is the behavior you're seeing. I would change that code to

// perform select
struct timeval_t time_val = { 0 }; // zeros the time values in the struct
if(time_out_valid())
{
  // whatever this does to retrieve the timeout, presumably it's
  // returning a non-zeroed struct
  time_val = GetTimeOut();
}
// the address of `time_val` is _not_ 0/NULL. It is valid.
// If `time_out_valid() is false, then its data fields `tv_sec` and
// `tv_nsec` are still both 0, and `select` will return immediately.
err = select(nfds, &f_set, 0, 0, &time_val);

This is IMHO, but as a practice, I always use NULL for NULL/invalid pointers, and 0 for actual zero values (for ints, for instance). That just makes it clearer to me, at a glance, what data types are being used/expected. However, I've seen plenty of C code that uses 0 for NULL/invalid pointers, so that's hardly a "best practice rule". But, with that in mind, I would change your select call to

err = select(nfds, &f_set, NULL, NULL, &time_val);

I see you're tagged c++. If you're using modern c++, you can probably do

err = select(nfds, &f_set, nullptr, nullptr, &time_val);

as the preferred method, but I have not tested this.

yano
  • 4,827
  • 2
  • 23
  • 35
0

I found the problem with the infinite select blockage. I was not closeing the socket after processing.

kishoredbn
  • 2,007
  • 4
  • 28
  • 47