0

I'm doing a blocking connect() call on a client UNIX socket. Below is an example of the code:

    // Create socket.

    fds[i] = socket(AF_UNIX, SOCK_STREAM, 0);
    if (fds[i] == -1)
        {
        result = -1;
        goto done;
        }
    printf("generate_load thread, fds[%d]: %d\n", i, fds[i]);
//      int flags = fcntl(fds[i], F_GETFL);
//      fcntl(fds[i], F_SETFL, flags | O_NONBLOCK);

    // If we have a timeout value we're only going to use that as
    // a connect timeout.  From looking at some source code, it
    // appears the only way to timeout (correctly) a unix domain
    // socket connect() call is to set the send timeout.

    struct timeval existing_timeout;
    if (timeout != 0)
        {
        socklen_t len = sizeof(existing_timeout);
        getsockopt(fd, SOL_SOCKET, SO_SNDTIMEO, &existing_timeout,
                &len);

        struct timeval tv;
        tv.tv_sec = timeout / 1000000;
        tv.tv_usec = timeout % 1000000;
        setsockopt(fd, SOL_SOCKET, SO_SNDTIMEO, &tv, sizeof(tv));
        }

    // Set socket name.

    memset(&addr, 0, sizeof(addr));
    addr.sun_family = AF_UNIX;
    strncpy(addr.sun_path, socket_name, sizeof(addr.sun_path) - 1);

    // @ indicates abstract name and abstract names begin with a NULL
    // byte.

    if (socket_name[0] == '@')
        addr.sun_path[0] = '\0';

    // Connect.

    result = connect(fds[i], (struct sockaddr*) &addr, sizeof(addr));
    if (result == -1)
        {
        printf("generate_load thread, failed connecting: %d\n", errno);
        if (errno == EAGAIN)
            errno = ETIMEDOUT;
        goto done;
        }

    printf("generate_load thread, connected fds[%d]: %d\n", i, fds[i]);

    // If we set a timeout then set it back to what it was.

    if (timeout != 0)
        {
        setsockopt(fds[i], SOL_SOCKET, SO_SNDTIMEO, &existing_timeout,
                sizeof(existing_timeout));
        }

This code all works fine until the accepting side, which for now is in the same process, fails due to the file descriptor limit. The accept() call fails with errno = 24 (EMFILE). I'm fine with getting the error, but why is the client not seeing an error? Instead the client is blocked and never returns. As you can see, I commented out the lines that put the socket in non-blocking mode. I believe in non-blocking mode I encounter some EAGAIN errors.

Also, when I hit the file descriptor limit the accepting side appears to constantly be attempting to accept that socket. I'm using select() and waiting for the listening socket to be ready for read. When it is I do an accept(). I can understand getting the first EMFILE error, but I would have thought that error would have been transmitted back to the connect() call, which would have caused the code to break out of its loop and thus no more connect calls will be made which I would have thought would cause the accepting side to be blocked on the select() call.

Below is a snippet of the listening side. The code below is within a while(1) loop which first calls select():

if (FD_ISSET(ti->listen_fd, &read_set) != 0)
    {
    printf("select thread, accepting socket\n");
    int sock = accept(ti->listen_fd, NULL, NULL);
    printf("select thread, accepted socket\n");
    if (sock == -1)
        {
        printf("select thread, failed accepting socket: %d\n", errno);
        if (error_threshold_met(&eti) == 0)
            {
            log_event(LOG_LEVEL_ERROR, "select thread, accept() "
                    "failed: %s", get_error_string(errno, error_string,
                    sizeof(error_string)));
            }
        }

The code appears to work fine until I hit the 1024 file descriptor limit. Any ideas why it's behaving this way? Should it be and I'm just not understanding how it should be working?

Thanks, Nick

  • 2
    While the server ran out of file handle, the OS may have queued the client connection attempt. The client will only disconnect when it is denied the connection. Absence of a reply will just block the client. – alvits Aug 03 '16 at 23:45
  • @alvits: Thanks. So there's no way for me to solve this problem? The connect() will be blocked indefinitely? Is the only solution to timeout the connect() call? –  Aug 03 '16 at 23:55
  • @nickdu - I see that you are already sending a timeout value, you should set socket to `O_NONBLOCK`. – alvits Aug 04 '16 at 00:05
  • @alvits [SO_SNDTIMEO](http://linux.die.net/man/7/socket) is a write timeout. It has nothing to do with `connect()`. – user207421 Aug 04 '16 at 00:09
  • @alvits: yes, if you look at the code I do have some lines in there to set the connecting side of the socket to non-blocking mode. I commented those out temporarily because it didn't seem to be timing out as I expected. It was returning an EAGAIN, but way sooner than the timeout value I gave it. –  Aug 04 '16 at 00:12
  • @EJP: I was looking over some source code and it appears the SO_SNDTIMEO is also used to timeout the connect. –  Aug 04 '16 at 00:14
  • @nickdu I was looking at the *man* page which says not. – user207421 Aug 04 '16 at 00:15
  • @EJP: check out my answer to http://stackoverflow.com/questions/35801679/cant-seem-to-get-a-timeout-working-when-connecting-to-a-socket. I include some snippets of source I found. –  Aug 04 '16 at 00:19
  • @nickdu - `EAGAIN` is described as _No more free local ports or insufficient entries in the routing cache._. This is better than a `TIMEOUT` because in reality, your client really can't connect to the server due to resource exhaustion. – alvits Aug 04 '16 at 00:22
  • It isn't germane to this question. – user207421 Aug 04 '16 at 00:22
  • @nickdu - as to my comment about adding `O_NONBLOCK`, it's because you commented it out that caused the client connection to block. – alvits Aug 04 '16 at 00:24
  • @alvits: yes, I'm realizing that commenting out the call is what's got the connecting side blocked, but I was assuming the failure on the accepting side would be reported to the connect() call, so this is why I kept the code commented out. I'm not learning more about how connect()/accept() works. –  Aug 04 '16 at 00:29
  • The `EAGAIN` is curious. `connect()` returns error `EINPROGRESS`, not `EAGAIN`, if the socket is non-blocking and the connection wasn't completed immediately. – user207421 Aug 04 '16 at 01:29
  • connect was returning EAGAIN in non-blocking mode on my fedora 23 OS. However, once I get the timeout working by setting the send timeout on the socket, I removed the code which sets the socket in non-blocking mode and the code was timing out as expected. It was still returning EAGAIN, but once I got the connect timeout working, I don't really care what the error is. –  Aug 04 '16 at 01:33

1 Answers1

3

connect() and accept() are not interlocked. You can call connect() and have it return without ever calling accept() at all. The server-side part of the TCP handshake happens in the kernel independently of accept(). All that accept() does is pick an incoming connection off a queue and create a socket around it, blocking while the queue is empty. The socket-creation part is failing due to FD exhaustion, but the actual connection is already established.

user207421
  • 305,947
  • 44
  • 307
  • 483
  • Thanks, but I'm not sure that's true. If it was, why is my connect() call blocked when I hit this condition? Also, I set the backlog on the listen() call to zero. –  Aug 03 '16 at 23:53
  • It's true all right. The listen backlog queue fills up if you aren't accepting, which eventually causes new connections to block or fail, depending on the server platform. – user207421 Aug 03 '16 at 23:54
  • I don't believe the kernel acknowledges the connect (e.g., sends a SYN packet) until accept() returns. – Max Aug 03 '16 at 23:55
  • @Max Believe whatever you like, but it does. Try it. Call connect without calling accept. – user207421 Aug 03 '16 at 23:56
  • @EJP: So you're saying that since my listen backlog is zero, eg. there is no queue, the OS is blocking the connect() calls until either accept() succeeds or the connect() call times out? –  Aug 03 '16 at 23:59
  • Actually no. The kernel will adjust the backlog up or down to suit itself, and it will never operate with a zero backlog. The minimum is five or fifty depending on your platform. But the behaviour you describe is what happens when the backlog queue fills up. – user207421 Aug 04 '16 at 00:02
  • So one last question before I'm ready to mark this as the answer. It seems the accepting side is stuck in an infinite loop if the client doesn't timeout the call. Is there a way for the accepting side to remove the socket from the OS queue such that it will get out of this infinite loop? –  Aug 04 '16 at 00:26
  • The real problem is why did you run out of file descriptors? I would say you have an FD leak somewhere. Fix that. The client connect will timeout anyway, it isn't on an infinite timeout: about a minute. – user207421 Aug 04 '16 at 00:29
  • @EJP: The file descriptor exhaustion is on purpose. I'm testing some failure cases. I was hoping the failure was going to get reflected at the connecting side, but it wasn't (if you don't use a timeout). That's the main question here. –  Aug 04 '16 at 00:32
  • OK, doesn't change my answer. I would ask whether the test is still valid, given that the actual behaviour is not as you expected. – user207421 Aug 04 '16 at 00:33
  • @EJP: So are you saying that if the connecting side does not set a timeout then there is no way around getting stuck in an infinite loop in this condition? –  Aug 04 '16 at 00:35
  • I just said exactly the opposite. I said the client connect *will* timeout. – user207421 Aug 04 '16 at 00:36
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/119059/discussion-between-nickdu-and-ejp). –  Aug 04 '16 at 00:37
  • 1
    Enough is enough. I've answered this question. You're now asking another one. – user207421 Aug 04 '16 at 00:42
  • Just want to describe what fixed my problem. I had a bug in the code where I was using a different variable and thus was not setting the send timeout on the socket I was attempting to set it on. Once I fixed this and removed the call to set the socket in non-blocking mode, I was able to get the connect() to timeout as I desired. –  Aug 04 '16 at 01:29