10

Generic Receive Offload (GRO) is a software technique in Linux to aggregate multiple incoming packets belonging to the same stream. The linked article claims that CPU utilization is reduced because, instead of each packet traversing the network stack individually, a single aggregated packet traverses the network stack.

However, if one looks at the source code of GRO, this feels like a network stack in itself. For example, an incoming TCP/IPv4 packet needs to go through:

Each function performs decapsulation and looks at respective frame/network/transport headers as would be expected from the "regular" network stack.

Assuming the machine does not perform firewall/NAT or other obviously expensive per-packet processing, what is so slow in the "regular" network stack that the "GRO network stack" can accelerate?

Tgilgul
  • 1,614
  • 1
  • 20
  • 35
user1202136
  • 11,171
  • 4
  • 41
  • 62

1 Answers1

18

Short Answer: GRO is done very early in the receive flow so it basically reduces the number of operations by ~(GRO session size / MTU).

More details: The most common GRO function is napi_gro_receive(). It is used 93 times (in kernel 4.14) by almost all networking driver. By using GRO at NAPI level, the driver is doing the aggregation to a large SKB very early, right at the receive completion handler. This means that all the next functions in the receive stack do much less processing.

Here is a nice visual representation of the RX flow for a Mellanox ConnectX-4Lx NIC (sorry this is what I have access to): enter image description here

As you can see, GRO aggregation is at the very bottom of the call stack. You can also see how much work is done afterwards. Imagine how much overhead you'll have if each of these functions would operate on a single MTU.

Hope this helps.

Tgilgul
  • 1,614
  • 1
  • 20
  • 35
  • Hi! Is ConnextX-4Lx Ethernet advanced offloading functionality the same as in some ConnectX-4 - based VPI card with port in Ethernet mode and with same ethernet port speed? – osgx Nov 25 '17 at 18:32
  • 1
    ConnectX-4Lx is a bit newer architecture, so although they are pretty similar, there are some improvements in ConnectX-4Lx. One difference very related to this post is HW LRO. Both architectures support this feature, which does the same functionality as GRO, but all at HW level, meaning zero overhead to the SW and all the benefit. The difference is that ConnectX-4 needs to allocate much (much) more memory in order to do so compared to ConnectX-4Lx since it is missing a key feature (stride-rq). – Tgilgul Nov 25 '17 at 21:35
  • And in RDMA over Ethernet mode both X-4 and X-4Lx should give same performance, and both require special RDMA-compatible switches? – osgx Nov 25 '17 at 23:54
  • Yes, but ConnectX-4 is 100GbE dual port while ConnectX-4Lx max speed is 50GbE single port or 25GbE dual port. – Tgilgul Nov 26 '17 at 01:16
  • How did you generate this plot? I'm aware of how to generate a flame graph, but not how to profile the network stack directly. – Jason R Mar 13 '19 at 14:17
  • @JasonR the best way is to run a single stream benchmark at very low rates and sample the CPU handling the traffic. Make sure to have kernel compiled with symbols of course. – Tgilgul Mar 13 '19 at 18:13