Low throughput with XDP_TX in comparison with XDP_DROP/REDIRECT

Question

I have developed a XDP program that filters packets based on some specific rules and then either drops them (XDP_DROP) or redirects them (xdp_redirect_map) to another interface. This program was well able to process a synthetic load of ~11Mpps (that's all what my traffic generator is capable of) on just four CPU cores.

Now I've changed that program to use XDP_TX to send the packets out on the interface they were received on instead of redirecting them to another interface. Unfortunately, this simple change caused a big drop in throughput and now it hardly handles ~4Mpps.

I don't understand, what could be the cause for this or how to debug this further, that's why I'm asking here.

My minimal test setup to reproduce the issue:

Two machines with Intel x520 SFP+ NICs directly connected to each other, both NICs are configured to have as many "combined" queus as the machine has CPU cores.
Machine 1 runs pktgen using a sample application from the linux sources: ./pktgen_sample05_flow_per_thread.sh -i ens3 -s 64 -d 1.2.3.4 -t 4 -c 0 -v -m MACHINE2_MAC (4 threads, because this was the config that resulted in the highest generated Mpps even though the machine has way more than 4 cores)
Machine 2 runs a simple program that drops (or reflects) all packets and counts the pps. In that program, I've replaced the XDP_DROP return code with XDP_TX. - Whether I swap the src/dest mac addresses before reflecting the packet did never cause a difference in throughput, so I'm leaving this out here.

When running the program with XDP_DROP, 4 cores on Machine 2 are slightly loaded with ksoftirqd threads while dropping around ~11Mps. That only 4 cores are loaded makes sense, given that pktgen sends out 4 different packets that fill only 4 rx queues becaue of how the hashing in the NIC works.

But when running the program with XDP_TX, one of the cores is a ~100% busy with ksoftirqd and only ~4Mpps are processed. Here I'm not sure, why that happens.

Do you have an idea, what might be causing this throughput drop and CPU usage increase?

Edit: Here some more details about the configuration of Machine 2:

# ethtool -g ens2f0
Ring parameters for ens2f0:
Pre-set maximums:
RX:             4096
RX Mini:        n/a
RX Jumbo:       n/a
TX:             4096
Current hardware settings:
RX:             512   # changing rx/tx to 4096 didn't help
RX Mini:        n/a
RX Jumbo:       n/a
TX:             512

# ethtool -l ens2f0
Channel parameters for ens2f0:
Pre-set maximums:
RX:             n/a
TX:             n/a
Other:          1
Combined:       63
Current hardware settings:
RX:             n/a
TX:             n/a
Other:          1
Combined:       32

# ethtool -x ens2f0
RX flow hash indirection table for ens2f0 with 32 RX ring(s):
    0:      0     1     2     3     4     5     6     7
    8:      8     9    10    11    12    13    14    15
   16:      0     1     2     3     4     5     6     7
   24:      8     9    10    11    12    13    14    15
   32:      0     1     2     3     4     5     6     7
   40:      8     9    10    11    12    13    14    15
   48:      0     1     2     3     4     5     6     7
   56:      8     9    10    11    12    13    14    15
   64:      0     1     2     3     4     5     6     7
   72:      8     9    10    11    12    13    14    15
   80:      0     1     2     3     4     5     6     7
   88:      8     9    10    11    12    13    14    15
   96:      0     1     2     3     4     5     6     7
  104:      8     9    10    11    12    13    14    15
  112:      0     1     2     3     4     5     6     7
  120:      8     9    10    11    12    13    14    15
RSS hash key:
d7:81:b1:8c:68:05:a9:eb:f4:24:86:f6:28:14:7e:f5:49:4e:29:ce:c7:2e:47:a0:08:f1:e9:31:b3:e5:45:a6:c1:30:52:37:e9:98:2d:c1
RSS hash function:
    toeplitz: on
    xor: off
    crc32: off

# uname -a
Linux test-2 5.8.0-44-generic #50-Ubuntu SMP Tue Feb 9 06:29:41 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Edit 2: I've also tried MoonGen as a packet generator now and flooded Machine 2 with 10Mpps and 100 different packet variations (flows). Now the traffic is way better distributed between the cores when dropping all these packets with minimal CPU load. But XDP_TX can still not keep up and loads a single core to a 100% while processing ~3Mpps.

What throughput do you get with `xdp_redirect_map`? Are you passing `-S` to the bcc script by any chance? — pchaigno, Mar 18 '21 at 21:37
That a single core is used for `XDP_TX` seems a bit strange. It might be worth checking what's happening there (queue config. on the NIC, IRQ affinities). — pchaigno, Mar 18 '21 at 21:38
Thank you for your comment. Dropping all packets is as fast as redirecting all packets with `xdp_redirect_map`: ~11Mpps. Only XDP_TX is way slower. No, I did not enable SKB mode, in fact, I can even reproduce the issue by loading a minimal XDP program with just one line: `return XDP_TX;` which still results in ~4Mpps (can see the bandwith in bmon). — Marcus Wichelmann, Mar 19 '21 at 07:04
@pchaigno I've extended the question with more information about the NIC now. If you know more places to look at that might be interesting, please let me know and I'll add them. — Marcus Wichelmann, Mar 19 '21 at 07:26

Marcus Wichelmann · Accepted Answer · 2021-03-19T12:55:03.230

1

I've now upgraded the kernel of Machine 2 to 5.12.0-rc3 and the issue disappeared. Looks like this was a kernel issue.

If somebody knows more about this or has a changelog regarding this, please let me know.

edited Mar 19 '21 at 12:55

answered Mar 19 '21 at 12:04

Marcus Wichelmann

762
1
6
18

Low throughput with XDP_TX in comparison with XDP_DROP/REDIRECT

1 Answers1