SMP affinity routing doesn't work with GICv2 on ARM

Question

There are 4 CPU cores and one Ethernet card on my Raspberry Pi.
I need interrupts from NIC to be routed to all the 4 CPU cores.
I set the /proc/irq/24/smp_affinity to 0xF (1111), but that doesn't help.
In sixth column of /proc/interrupts I don't see IO-APIC (which definitely supports* affinity routing) but GICv2 instead. Still can't find any useful info about GICv2 and smp_affinity.

Does GICv2 support SMP affinity routing?

*UPD:
from that post:

The only reason to look at this value is that SMP affinity will only work for IO-APIC enabled device drivers.

artless noise · Accepted Answer · 2020-01-30T03:55:47.993

TL;DR - The existence of /proc/irq/24/smp_affinity indicates that your Linux SMP system supports affinity. The text IO-APIC is the type of interrupt controller (typical PC) and it does NOT indicate that the system can handle affinities. On ARM systems a GIC is usually the interrupt controller, although some interrupts can be routed to a 'sub-controller'.

At least the mainline is supporting some affinities as per Kconfig. However, I am not sure what you are trying to do. The interrupt can only run on one CPU as only one CPU can take the data off the NIC. If a particular CPU is running network code and the rest are used for other purposes, the affinity makes sense.

The data on that core will probably not be in cache as the NIC buffers are probably DMA and not cacheable. So, I am not really sure what you would achieve or how you would expect the interrupts to run on all four CPUs? If you have four NIC interfaces, you can peg each to a CPU. This may be good for power consumption issues.

Specifically, for your case of four CPUs, the affinity mask of 0xf will disable any affinity and this is the default case. You can cat /proc/irq/24/smp_affinity to see the affinity is set. Also, the existence of this file would indicate that your Linux SMP system supports affinity. The text IO-APIC is the type of interrupt controller (typical PC) and it does NOT indicate that the system can handle affinities.

See also:

NOTE This part is speculative and is NOT how any cards I know of works.

The major part that you want is not generally possible. The NIC registers are a single resource. There are multiple registers and they have general sequences to reading and writing registers to perform an operation. If two CPUs were writing (or even reading) the register at the same time, then it will severely mix up the NIC. Often the CPU is not that involved in an interrupt and only some DMA engine needs to be told about a next buffer in an interrupt.

In order for what you want to be useful, you would need a NIC with several register 'banks' that can be used independently. For instance, just READ/WRITE packet banks is easy to comprehend. However, there may be several banks to write different packets and then the card would have to manage how to serialize them. Also, the card could do some packet inspection and interrupt different CPUs based on fixed packet values. Ie, a port and IP. This packet matching would generate different interrupt sources and different CPUs could handle different matches.

This would allow you to route different socket traffic to a particular CPU using a single NIC.

The problems are to make this card in hardware would be incredible complex compared to existing cards. It would be more expensive and it would take more power to operate.

If it is standard NIC hardware, there is no gain by rotating CPUs if the original CPU is not busy. If there is non-network activity, it is better to leave other CPUs alone so there cache can be use for a different workload (code/data). So in most case, it is best just to keep the interrupt on a fixed CPU unless it is busy and then it may ping-pong between a few CPUs. It would almost never be beneficial to run the interrupt on all CPUs.

*"only one CPU can take the data off the NIC"* - is this really true? I heard about ***routing*** of interrupts. I supposed that interrupts from one NIC will be separated (routed) to several CPUs. Is it possible? P.S. a bit updated question — NK-cell, Jan 28 '20 at 08:40
Info from 2.4 kernel is obsolete. Only one cpu can read NIC registers at a time. You are correct that multiple event can run on different cpus. It is often good to run on the same cpu for code cache. — artless noise, Jan 28 '20 at 12:41
*"multiple event can run on different cpus"* - but this has nothing to do with SMP IRQ affinity? But which adjustment should I do? Please give me a hint in which direction to google — NK-cell, Jan 28 '20 at 13:53
I dont think that is a design goal. You have global work load and this local load. As only one cpu can handle the data or packet at a time you wont accelerate things and will slow down non network activity. — artless noise, Jan 28 '20 at 14:04
This is a great answer! I just verified that on the Raspberry Pi 4, you can indeed set the interrupt affinity mask for example for the network interrupts, by using e.g. `echo e > /proc/irq/38/smp_affinity` and `echo e > /proc/irq/39/smp_affinity`. Hex `e` corresponds to 1110b, i.e. run on any CPU except for CPU0. When looking at `/proc/interrupts`, I can see the counters on CPU1 increase when network traffic comes in. — Michael, Jun 26 '20 at 07:45
Note that only CPU1 is getting interrupts, even though CPU2 and 3 are in the affinity set. It isn't distributing the interrupts. E.g., the first IRQ 38 on CPU1, then the next IRQ 38 on CPU2, and so on in a round-robin fashion. But as I tried to explain in my answer, this isn't supported because it's not a good idea. — TrentP, Nov 04 '20 at 08:34

score 0 · Answer 2 · answered Jan 31 '20 at 00:21

I do not believe the the GICv2 supports IRQ balancing. Interrupts will always be handled by the same CPU. At least this was the case when I looked at this last for 5.1 kernels. The discussion at the time was that this would not be supported because it was not a good idea.

You will see interrupts will always be handled by CPU 0. Use something like ftrace or LTTng to observe what CPU is doing what.

I think via the affinity setting you could prevent the interrupt from running on a CPU, by setting that bit to zero. But this does not balance the IRQ over all CPUs on which it is allowed. It will still always go to the same CPU. But you could make this CPU 1 instead of 0.

So what you can do, is to put certain interrupts on different CPUs. This would allow something like SDIO and network to not vie for CPU time from the CPU 0 in their interrupt handlers. It's also possible to set the affinity of a userspace process such that it will not run on the same CPU which will handle interrupts and thereby reduce the time that the userspace process can be interrupted.

So why don't we do IRQ balancing? It ends up not being useful.

Keep in mind that the interrupt handler here is only the "hard" IRQ handler. This usually does not do very much work. It acknowledges the interrupt with the hardware and then triggers a back-end handler, like a work queue, IRQ thread, soft-irq, or tasklet. These don't run in IRQ context and can and will be scheduled to different CPU or CPUs based on the current workload.

So even if the network interrupt is always routed to the same CPU, the network stack is multi-threaded and runs on all CPUs. Its main work is not done in the hard IRQ handler that runs on one CPU. Again, use ftrace or LTTng to see this.

If the hard IRQ does very little, what is most important is to reduce latency, which is best done by running on the same CPU to improve cache effectiveness. Spreading it out is likely worse for latency and also for the total cost of handling the IRQs.

The hard IRQ handler can only run once instance at a time. So even if it was balanced, it could use just one CPU at any one time. If this was not the case, the handler would be virtually impossible to write without race conditions. If you want to use multiple CPUs at the same time, then don't do the work in a hard IRQ handler, do it in a construct like a workqueue. Which is how the network stack works. And the block device layer.

IRQs aren't balanced, because it's not usually the answer. The answer is to not do the work in IRQ context.

The original question was not about IRQ balancing. The OP thought that affinity meant IRQ balancing. The last part of my answer also explains why this won't work. If hardware were completely different, it could make sense. Your example of SDIO and NIC is what affinity is for (but that is not balancing an IRQ from a single device). — artless noise, Nov 04 '20 at 19:15
One could put spin locks around hardware access, so that two CPUs could run the same hard IRQ at once, and not clobber each other on the same registers. Locks like this are normal in kernel code to protect hardware access from hard IRQ handlers vs kernel thread(s) running at the same time. And *if* the kernel processed the entire network stack for a packet in the NIC irq handler, then maybe running multiple IRQ handlers for the NIC at once with locks to protect critical sections would make sense. But the kernel doesn't do that. It runs the multi-threaded network stack outside hard IRQs. — TrentP, Nov 05 '20 at 03:53
Yes, but as with FIQ mode, IRQ mode, TrustZone, etc. You don't need spin locks. So the idea is that CPU1 and CPU2 would see different banks of registers for the IRQ which would handle different packet reception from the NIC. The NIC would have to have some vector support based on ip/port (or socket) smart decoding. That was the idea and this is the only way that the 'balancing' the OP wanted would make some sense. If that happened, the NIC could just present multiple IRQ sources and use affinity to lock to CPU1/CPU2. Then these IRQ can happen simultaneously. — artless noise, Nov 05 '20 at 15:49
Global NIC behaviour would need a spinlock for things like MUA/Signalling changes. The idea is the discrete per/packet data could route to a CPU that was handling the data for that port. So the user code on the CPU/cache/TLS/TLB,etc would already be primed and there may be no user-space context switching; a file will just show more data available. This would be beneficial, but that is not the way things work... — artless noise, Nov 05 '20 at 16:17
A real example would be NVMe. My single NVMe ssd has eight hard IRQs. While a given IRQ is always on the same core, the eight IRQs run on seven different cores. (q0 and q7 are on the same core, and q0 has 1000x fewer irqs than q1-7, so maybe that makes sense). So more than one CPU can get blocks out of the hardware at the same time. One assumes each queue has a mostly independent register bank so very little code needs to execute inside a locked critical section. — TrentP, Nov 06 '20 at 00:19

SMP affinity routing doesn't work with GICv2 on ARM

2 Answers2