12

When an application has a huge amount of data (400M) to write to a non-blocking socket, write() returns EWOULDBLOCK or EAGAIN when the send buffer becomes full.

When the socket is (e)polled, I sometimes see a write-ready notification happening when there's 7M space in the send buffer, sometimes 20M and at other times 1M. The variation in the delay between write-ready callbacks is huge: from milliseconds to tens of seconds!

So my question is when exactly does the kernel trigger a write-ready for a socket? What affects triggering of write-ready? Obviously it's not triggered as soon as 1B is written to the wire.

Any help would be appreciated!

I'm using:

Ubuntu 12.04 LTS

Kernel 3.8.0-39-generic

Arch: x86_64

EDIT: Sockets in this context are TCP/IP sockets.

themoondothshine
  • 2,983
  • 5
  • 24
  • 34
  • 2
    At least 1 byte of send buffer available. That's exactly the definition. Of course, data is sent in datagrams, not byte-wise, but still. – Damon May 08 '14 at 16:39
  • @Damon I thought so too. But it doesn't seem that way. Can you point me to some documentation? – themoondothshine May 09 '14 at 12:34
  • `poll` returns when an "event" occurs (and `epoll` is really just the same). The standard [states](http://pubs.opengroup.org/onlinepubs/9699919799/functions/poll.html) that "normal data may be written" as condition for the `POLLOUT` event without requiring a minimum amount (so, _any_ amount, including 1 byte could be written). Of course in reality, only complete datagrams can be sent (and thus only complete datagrams can be removed), so that's units of around 1kB (more or less, depending on MTU). – Damon May 09 '14 at 15:06
  • 1
    Also, there are measures to avoid "IRQ storms" or what the correct term is for them (too many interrupts). If you send megabytes of data, the kernel will tell the network card to pull 20 or 50 packets via DMA to send them, and the network card will generate one "done!" interrupt after that. So you may as well only unblock after a few hundred kB, too. It will try hard to avoid having a hundred thousand interrupts per second, since that would kill performance. – Damon May 09 '14 at 15:07
  • Of course. Maybe I should rephrase the question. For some reason, throughput seems to be affected... TCP_NODELAY is set, but there seems to be a direct correspondence to arrival of write-ready and data being written to the wire. Throughput for smaller amounts of data seems to be just fine. – themoondothshine May 11 '14 at 15:08
  • See my answer [here](https://stackoverflow.com/a/55808163/3002584). – OfirD Apr 23 '19 at 14:01

1 Answers1

3

So my question is when exactly does the kernel trigger a write-ready for a socket?

tl;dr; As long as your socket has enough buffer space writes succeed and epoll_wait will return events to say so in the default level-triggered mode. If the socket runs out of space blocking writers will sleep. The kernel will wake processes (or deliver epoll events to say the socket is writable) when data is acknowledged freeing up space but only if the socket had run out of space. Just as before if nothing changes as long as the socket is writable the level-triggered events will pour in, even if no new notifications come from TCP.

The function that performs the actual notification is sk_write_space. This is a member of struct sock and for TCP the relevant implementation is sk_stream_write_space in stream.c.

    ...
    if (skwq_has_sleeper(wq))
        wake_up_interruptible_poll(&wq->wait, EPOLLOUT |
                    EPOLLWRNORM | EPOLLWRBAND);
    if (wq && wq->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN))
        sock_wake_async(wq, SOCK_WAKE_SPACE, POLL_OUT);
    ...

This function wakes up any callers that might be waiting for memory. (Compare this with sock_def_write_space.

But when is sk_write_space called? There are a few call sites but the most prominent is tcp_new_space which is called by tcp_check_space, which is called by tcp_data_snd_check which is called from a bunch of places on the receive path. The function has a descriptive comment:

 When incoming ACK allowed to free some skb from write_queue,
 we remember this event in flag SOCK_QUEUE_SHRUNK and wake up socket
 on the exit from tcp input handler.

tcp_check_space is interesting:

    if (sock_flag(sk, SOCK_QUEUE_SHRUNK)) {
        sock_reset_flag(sk, SOCK_QUEUE_SHRUNK);
        /* pairs with tcp_poll() */
        smp_mb();
        if (sk->sk_socket &&
            test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) {
            tcp_new_space(sk);
            ...
        }

Some relevant bits here:

  1. SOCK_QUEUE_SHRUNK is defined as "write queue has been shrunk recently" and is set set on the transmit path. tcp_check_space checks and clears it.
  2. SOCK_NOSPACE is set on the transmit path when we run out of buffer space.

The conclusion from all this is that tcp_check_space avoids sending events unless the socket was out of space.

What about tcp_data_snd_check? During the steady state the most relevant calls are in tcp_rcv_established:

  1. The fast-path: https://github.com/torvalds/linux/blob/master/net/ipv4/tcp_input.c#L5575

  2. The almost-fast path: https://github.com/torvalds/linux/blob/master/net/ipv4/tcp_input.c#L5618

  3. The slow-path: https://github.com/torvalds/linux/blob/master/net/ipv4/tcp_input.c#L5658

All of these signal data was successfully ACKd.


There are other callers of sk_write_space in TCP. do_tcp_sendpages and tcp_sendmsg_locked call it on error paths to make sure callers are woken up. do_tcp_setsockopt calls it when setting TCP_NOTSENT_LOWAT.

cnicutar
  • 178,505
  • 25
  • 365
  • 392
  • But this isn't true. If you call `select()` and the socket send buffer has space, it will return immediately, and continue to do so even if you never write anything. You need to distinguish here between level triggering and edge triggering, and the various `select()/poll()/epoll()` functions and how they are triggered. – user207421 Feb 10 '19 at 02:35
  • @user207421 I focused on the interesting part. I added a note to explain the socket remains writable if you don't run out of space. – cnicutar Feb 10 '19 at 08:32
  • 1
    But you didn't fx the underlying problem. You are still claiming that it all happens 'only if the socket had run out of space'. This is exactly the pint at issue – user207421 Feb 10 '19 at 08:56