Generic Receive Offload (GRO) is a software technique in Linux to aggregate multiple incoming packets belonging to the same stream. The linked article claims that CPU utilization is reduced because, instead of each packet traversing the network stack individually, a single aggregated packet traverses the network stack.
However, if one looks at the source code of GRO, this feels like a network stack in itself. For example, an incoming TCP/IPv4 packet needs to go through:
Each function performs decapsulation and looks at respective frame/network/transport headers as would be expected from the "regular" network stack.
Assuming the machine does not perform firewall/NAT or other obviously expensive per-packet processing, what is so slow in the "regular" network stack that the "GRO network stack" can accelerate?