0

Modern AMD CPUs consist of multiple CCX. Each CCX has a separate L3 cache.

It's possible to set process affinity to limit a process to certain CPU cores.

Is there a way to force Linux to schedule two processes (parent process thread & child process) on two cores that share L3 cache, but still leave the scheduler free to choose which two cores?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
cmpxchg8b
  • 661
  • 5
  • 16

2 Answers2

2

Newer Linux may do this for you: Cluster-Aware Scheduling Lands In Linux 5.16 - there's support for scheduling decisions to be influenced by the fact that some cores share resources.

If you manually pick a CCX, you could give them each the same affinity mask that allows them to schedule on any of the cores in that CCX.

An affinity mask can have multiple bits set.


I don't know of a way to let the kernel decide which CCX, but then schedule both tasks to cores within it. If the parent checks which core it's currently running on, it could set a mask to include all cores in the CCX containing it, assuming you have a way to detect how core #s are grouped, and a function to apply that.

You'd want to be careful that you don't end up leaving some CCXs totally unused if you start multiple processes that each do this, though. Maybe every second, do whatever top or htop do to check per-core utilization, and if so rebalance? (i.e. change the affinity mask of both processes to the cores of a different CCX). Or maybe put this functionality outside the processes being scheduled, so there's one "master control program" that looks at (and possibly modifies) affinity masks for a set of tasks that it should control. (Not all tasks on the system; that would be a waste of work.)

Or if it's looking at everything, it doesn't need to do so much checking of current load average, just count what's scheduled where. (And assume that tasks it doesn't know about can pick any free cores on any CCX, like daemons or the occasional compile job. Or at least compete fairly if all cores are busy with jobs it's managing.)


Obviously this is not helpful for most parent/child processes, only ones that do a lot of communication via shared memory (or maybe pipes, since kernel pipe buffers are effectively shared memory).

It is true that Zen CPUs have varying inter-core latency within / across CCXs, as well as just cache hit effects from sharing L3. https://www.anandtech.com/show/16529/amd-epyc-milan-review/4 did some microbenchmarking on Zen 3 vs. 2-socket Xeon Platinum vs. 2-socket ARM Ampere.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
1

The underlying library functions for processes support setting CPU set masks, which allows you to define a set of cores on which a process is elegible to run. There's the equivalent for pthreads. See this man page and this command line tool.

This is quite an intersting piece on how Linux treats NUMA systems. It basically tries to keep code and memory together, so it is already pre-disposed to doing what you want, out of the box. Though I think it might get fooled if the interaction between two processes is via, for example, shared memory that one allocates and the other ends up merely "accessing" (i.e. in starting the second process, the kernel doesn't know it's going to access memory allocated by a separate process that it's actually put on a core a long way away [in NUMA terms]).

I think CPU sets shows some promise. At the bottom of that page there's examples of putting a shell into a specific CPU set. This might be a way that any subsequent processes started from that shell will be kept within the same CPU set, without you having to specifically set core affinities for them (I think they'll inherit that from the shell). You'd still be defining the CPU set in terms of which CPUs are in the set, but doing it only once.

bazza
  • 7,580
  • 15
  • 22
  • 1
    For my specific use case I'm seeing a +40% performance improvement when setting affinity to cores on the same CCX. I'm hoping there's a way I could get the Linux kernel to automatically load balance the processes over CCXes while still always keeping the pairs of processes on the same CCX. So essentially, I don't want to pick specific cores, but just tell the CPU: pick whatever CCX you want to run process A on and then you must schedule process B on one of the other 2 cores in the same CCX. – cmpxchg8b Feb 25 '22 at 20:52
  • The whole point of this question is that AMD CPUs with multiple core-complexes are *not* flat for inter-core latency. See https://www.anandtech.com/show/16529/amd-epyc-milan-review/4. @cmpxchg8b's question seems reasonable to me. – Peter Cordes Feb 25 '22 at 23:23
  • @cmpxchg8b 40%?! Well, that is a significant improvement! Makes me wonder what's going on with Linux on AMD CPUs... I was on Intel Nahlem cores when I tried, quite old now.There might be something in this: https://linux.die.net/man/7/cpuset; note the interesting reference to fork(), which keeps the child process in the same CPU set as the parent. Also looks like you can set load balancing options per CPU set. So you could have processes in a CPU set, and specific rules in that set as to how load balancing is done. – bazza Feb 25 '22 at 23:27
  • @PeterCordes yes I realise that now. Any ideas? If fork() keeps child and parent together, that could be a way of having two processes that are kept together within the same CPU set, without having to explicitly pin them (which is what I think cmpxchg8b is asking). – bazza Feb 25 '22 at 23:32
  • @bazza: Intel CPUs from Nehalem to present have a single shared L3 cache across all cores. AMD CPUs don't; Like the question says, two cores in different core-complexes of the same multi-core CPU might not share L3 cache (Affecting cache hit rates and bandwidth as well as latency). Re: how to deal with it: see my answer for my ideas. – Peter Cordes Feb 25 '22 at 23:32
  • @PeterCordes indeed so, but not when you've got a bunch of Nahalem CPUs in a single system, in which they're joined up by QPI. You end up with multiple L3 caches, one per CPU, not so very different to AMD core complexes. – bazza Feb 25 '22 at 23:35
  • 1
    Ah, I wasn't thinking multi-socket, but yeah true for that case. But then you also have local vs. remote DRAM, not just L3 cache, so the OS maybe tries harder because it knows about NUMA memory allocation. Scheduling for CCXs is relatively new, and maybe not always worth the effort; read-mostly workloads can just end up with the same data replicated in both L3 caches and still efficiently get to DRAM on misses. – Peter Cordes Feb 26 '22 at 01:54
  • 2
    @bazza I don't suspect anything is "going on" with Linux on AMD CPUs -- the processes spend most of their time communicating via shared memory, which is just a lot faster if the shared memory stays in the same L3 cache. – cmpxchg8b Mar 01 '22 at 11:00
  • @cmpxchg8b, that fits with the reports of non-uniform access times between CCXs, and the link about how Linux prefers to schedule processes on nodes nearest the memory they have allocated. Shared memory is awkward to handle because Linux can't know about the access to shmem in advance. The second process gets started on a different CCX and allocates non-shared memory there, only for it to then access the shmem at a large NUMA distance. So now it's stuck with long memory access times. Setting the affinity means both get started close together, so the mem allocs are all close by too. – bazza Mar 01 '22 at 20:59