13

I am running iperf measurements between two servers, connected through 10Gbit link. I am trying to correlate the maximum window size that I observe with the system configuration parameters.

In particular, I have observed that the maximum window size is 3 MiB. However, I cannot find the corresponding values in the system files.

By running sysctl -a I get the following values:

net.ipv4.tcp_rmem = 4096        87380   6291456
net.core.rmem_max = 212992

The first value tells us that the maximum receiver window size is 6 MiB. However, TCP tends to allocate twice the requested size, so the maximum receiver window size should be 3 MiB, exactly as I have measured it. From man tcp:

Note that TCP actually allocates twice the size of the buffer requested in the setsockopt(2) call, and so a succeeding getsockopt(2) call will not return the same size of buffer as requested in the setsockopt(2) call. TCP uses the extra space for administrative purposes and internal kernel structures, and the /proc file values reflect the larger sizes compared to the actual TCP windows.

However, the second value, net.core.rmem_max, states that the maximum receiver window size cannot be more than 208 KiB. And this is supposed to be the hard limit, according to man tcp:

tcp_rmem max: the maximum size of the receive buffer used by each TCP socket. This value does not override the global net.core.rmem_max. This is not used to limit the size of the receive buffer declared using SO_RCVBUF on a socket.

So, how come and I observe a maximum window size larger than the one specified in net.core.rmem_max?

NB: I have also calculated the Bandwidth-Latency product: window_size = Bandwidth x RTT which is about 3 MiB (10 Gbps @ 2 msec RTT), thus verifying my traffic capture.

nh2
  • 24,526
  • 11
  • 79
  • 128
Adama
  • 720
  • 2
  • 5
  • 23

2 Answers2

17

A quick search turned up:

https://github.com/torvalds/linux/blob/4e5448a31d73d0e944b7adb9049438a09bc332cb/net/ipv4/tcp_output.c

in void tcp_select_initial_window()

if (wscale_ok) {
    /* Set window scaling on max possible window
     * See RFC1323 for an explanation of the limit to 14
     */
    space = max_t(u32, sysctl_tcp_rmem[2], sysctl_rmem_max);
    space = min_t(u32, space, *window_clamp);
    while (space > 65535 && (*rcv_wscale) < 14) {
        space >>= 1;
        (*rcv_wscale)++;
    }
}

max_t takes the higher value of the arguments. So the bigger value takes precedence here.

One other reference to sysctl_rmem_max is made where it is used to limit the argument to SO_RCVBUF (in net/core/sock.c).

All other tcp code refers to sysctl_tcp_rmem only.

So without looking deeper into the code you can conclude that a bigger net.ipv4.tcp_rmem will override net.core.rmem_max in all cases except when setting SO_RCVBUF (whose check can be bypassed using SO_RCVBUFFORCE)

mvds
  • 45,755
  • 8
  • 102
  • 111
  • `... you can conclude that a bigger net.ipv4.tcp_rmem will override net.core.rmem_max in all cases except when setting SO_RCVBUF` - it's not clear to me yet whether that's happening in all cases: In the function you posted, `max_t(u32, sysctl_tcp_rmem[2], sysctl_rmem_max)` determines `rcv_wscale`, which is then used to set the window sizes. So the maximum of the two values should be used thorughout. However, I cannot experimentally confirm that: Setting `net.core.rmem_max` very large does not increase the window size beyond `tcp_rmem`'s max. – nh2 Feb 16 '16 at 17:38
  • 1
    I did not say the converse is true(!), I'd rather expect that a large `net.core.rmem_max` doesn't have much impact, as you have observed, because all other code only references `net.ipv4.tcp_rmem`; the reference to `net.core.rmem_max` is done *only once* in a routine called `tcp_select_initial_window`. But I haven't looked into all internals to see what's done with `rcv_wscale` after selecting the initial window. All I can conclude from a superficial code scan is that `net.ipv4.tcp_rmem` will be the effective value in all cases if it's bigger than `net.core.rmem_max`. – mvds Feb 16 '16 at 17:58
  • You are right, many places such as [this](https://github.com/torvalds/linux/blob/34229b277480f46c1e9a19f027f30b074512e68b/net/ipv4/tcp_input.c#L455) and [here](https://github.com/torvalds/linux/blob/34229b277480f46c1e9a19f027f30b074512e68b/net/ipv4/tcp_input.c#L607) refer directly to `sysctl_tcp_rmem[2]` instead of `rcv_wscale`. Since it's intransparent to both of us whether `rcv_wscale` is used anywhere else, I've just sent an email to `linux-man@vger.kernel.org` and asked for clarification of the man page. (Continuing in next comment; in any case, your answer definitely hits my bounty, thx!) – nh2 Feb 16 '16 at 18:25
  • Ah darn, I can only award the bounty in 24 hours; if I don't within the next 3 days, please ping me. Unfortunately I cannot link to the `linux-man@vger.kernel.org` mail because its archives generate the contents too slowly. – nh2 Feb 16 '16 at 18:27
  • 2
    Back to the topic, it may be that `"This value does not override the global net.core.rmem_max."` from the man page actually is to be interpreted as `"This value does not override the global net.core.rmem_max to the extent that it does not define the upper users can set with SO_RCVBUF. It does, however, override net.core.rmem_max if SO_RCVBUF is not used."`. This would be consistent with our findings, but if it is true, I find the wording unfortunate (thus my email to `linux-man`). – nh2 Feb 16 '16 at 18:28
  • 1
    Using your explanation, I also found this problem in `iperf3` https://github.com/esnet/iperf/issues/356 that puzzled me (`iperf3 -w` seemed to do nothing): https://github.com/esnet/iperf/issues/356 – nh2 Feb 16 '16 at 18:46
  • For completeness, I also wondered whether one should set `net.ipv4.tcp_mem` (which limits the total system-wide amount of memory to use for TCP, counted in pages). I've concluded it not to be necessary because the kernel sets that to a high default of ~9% of system memory, see [here](https://github.com/torvalds/linux/blob/a1d21081a60dfb7fddf4a38b66d9cef603b317a9/net/ipv4/tcp.c#L4116). – nh2 Sep 19 '20 at 01:53
  • @nh2 is it 9% or 1/128 (so 0.9%)? – Anirudh Goel Dec 03 '22 at 14:02
  • @AnirudhGoel I meant 9%, because that is literally what the comment in the code I linked says. It roughly agrees with what I observe from my laptop, where `sysctl net.ipv4.tcp_mem`'s last value is `1146462` (measured in 4KiB pages), and where I have 48 GiB RAM. – nh2 Dec 04 '22 at 15:11
1

net.ipv4.tcp_rmem takes precedence net.core.rmem_max according to https://serverfault.com/questions/734920/difference-between-net-core-rmem-max-and-net-ipv4-tcp-rmem:

It seems that the tcp-setting will take precendence over the common max setting


But I agree with what you say, this seems to conflict with what's written in man tcp, and I can reproduce your findings. Maybe the documentation is wrong? Please find out and comment!

Community
  • 1
  • 1
nh2
  • 24,526
  • 11
  • 79
  • 128
  • 2
    I've started a bounty to find out whether `man tcp` is really wrong. – nh2 Feb 16 '16 at 16:36
  • After @mvds's great reply, I've reported the apparent man page bug here: https://bugzilla.kernel.org/show_bug.cgi?id=209327 – nh2 Sep 19 '20 at 01:18