Intel OpenMP library slows down memory bandwidth significantly on AMD platforms by setting KMP_AFFINITY=scatter

Question

For memory-bound programs it is not always faster to use many threads, say the same number as the cores, since threads may compete for memory channels. Usually on a two-socket machine, less threads are better but we need to set affinity policy that distributes the threads across sockets to maximize the memory bandwidth.

Intel OpenMP claims that KMP_AFFINITY=scatter is to achieve this purpose, the opposite value "compact" is to place threads as close as possible. I have used ICC to build the Stream program for benchmarking and this claim is easily validated on Intel machines. And if OMP_PROC_BIND is set, the native OpenMP env vars like OMP_PLACES and OMP_PROC_BIND are ignored. You will get such a warning:

        OMP: Warning #181: OMP_PROC_BIND: ignored because KMP_AFFINITY has been defined

However, a benchmark on a newest AMD EPYC machine I obtained shows really bizarre results. KMP_AFFINITY=scatter gives the slowest memory bandwidth possible. It seems that this setting is doing exactly the opposite on AMD machines: placing threads as close as possible so that even the L3 cache at each NUMA node is not even fully utilized. And if I explicitly set OMP_PROC_BIND=spread, it is ignored by Intel OpenMP as the warning above says.

The AMD machine has two sockets, 64 physical cores per socket. I have tested using 128, 64, and 32 threads and I want them to be spread across the whole system. Using OMP_PROC_BIND=spread, Stream gives me a triad speed of 225, 290, and 300 GB/s, respectively. But once I set KMP_AFFINITY=scatter, even when OMP_PROC_BIND=spread is still present, Streams gives 264, 144, and 72 GB/s.

Notice that for 128 threads on 128 cores, setting KMP_AFFINITY=scatter gives better performance, this even further suggests that in fact all the threads are placed as close as possible, but not scattering at all.

In summary, KMP_AFFINITY=scatter displays completely opposite (in the bad way) behavior on AMD machines and it will even overwrite native OpenMP environment regardless the CPU brand. The whole situation sounds a bit fishy, since it is well known that ICC detects the CPU brand and uses the CPU dispatcher in MKL to launch the slower code on non-Intel machines. So why can't ICC simply disable KMP_AFFINITY and restore OMP_PROC_BIND if it detects a non-Intel CPU?

Is this a known issue to someone? Or someone can validate my findings?

To give more context, I am a developer of commercial computational fluid dynamics program and unfortunately we links our program with ICC OpenMP library and KMP_AFFINITY=scatter is set by default because in CFD we must solve large-scale sparse linear systems and this part is extremely memory-bound. I found that with setting KMP_AFFINITY=scatter, our program becomes 4X slower (when using 32 threads) than the actual speed the program can achieve on the AMD machine.

Update:

Now using hwloc-ps I can confirm that KMP_AFFINITY=scatter is actually doing "compact" on my AMD threadripper 3 machine. I have attached the lstopo result. I run my CFD program (built by ICC2017) with 16 threads. OPM_PROC_BIND=spread can place one thread in each CCX so that L3 cache is fully utilized. Hwloc-ps -l -t gives:

While setting KMP_AFFINITY=scatter, I got

I will try the latest ICC/Clang OpenMP runtime and see how it works.

For future reference, use `KMP_AFFINITY=verbose,scatter` to have the affinity map printed and compare the logical CPUs in the map with the output of `hwloc`. — Hristo Iliev, Oct 20 '20 at 09:16
@HristoIliev Thanks! I just tried with KMP_AFFINITY=verbose,scatter,granularity=core on two one-socket machines, one is AWS c5d.12xlarge with 24-core Xeon and the other is 64-core threadripper 3. Now I can confirm that the "scatter" algorithm is not designed properly. It will try to place threads across sockets, but on each socket, threads are placed as close as possible. Then for one-socket machine, threads are simply placed consecutively from core:0. — AeroD, Oct 20 '20 at 14:55
It has no impact for Intel chip since all cores share L3 cache. But for AMD chip, each 4 cores on one CCX has their own L3 cache. This placement lead to minimum usage of L3 cache on that chip. I will follow up with my study on the two-socket AMD EPYC machine later. — AeroD, Oct 20 '20 at 14:55

score 4 · Accepted Answer · edited Oct 20 '20 at 03:49

TL;DR: Do not use KMP_AFFINITY. It is not portable. Prefer OMP_PROC_BIND (it cannot be used with KMP_AFFINITY at the same time). You can mix it with OMP_PLACES to bind threads to cores manually. Moreover, numactl should be used to control the memory channel binding or more generally NUMA effects.

Long answer:

Thread binding: OMP_PLACES can be used to bound each thread to a specific core (reducing context switches and NUMA issues). OMP_PROC_BIND and KMP_AFFINITY should theoretically do that correctly, but in practice, they fail to do so on some systems. Note that OMP_PROC_BIND and KMP_AFFINITY are exclusive option: they should not be used together (OMP_PROC_BIND is a new portable replacement of the older KMP_AFFINITY environment variable). As the topology of the core change from one machine to another, you can use the hwloc tool to get the list of the PU ids required by OMP_PLACES. More especially hwloc-calc to get the list and hwloc-ls to check the CPU topology. All threads should be bound separately so that no move is possible. You can check the binding of the threads with hwloc-ps.

NUMA effects: AMD processors are built by assembling multiple CCX connected together with a high-bandwidth connection (AMD Infinity Fabric). Because of that, AMD processors are NUMA systems. If not taken into account, NUMA effects can result in a significant drop in performance. The numactl tool is designed to control/mitigate NUMA effects: processes can be bound to memory channels using the --membind option and the memory allocation policy can be set to --interleave (or --localalloc if the process is NUMA-aware). Ideally, processes/threads should only work on data allocated and first-touched on they local memory channels. If you want to test a configuration on a given CCX you can play with --physcpubind and --cpunodebind.

My guess is that the Intel/Clang runtime does not perform a good thread binding when KMP_AFFINITY=scatter is set because of a bad PU mapping (which could come from a OS bug, a runtime bug or bad user/admin settings). Probably due to the CCX (since mainstream processors containing multiple NUMA nodes were quite rare).
On AMD processors, threads accessing memory of another CCX usually pay an additional significant cost due to data moving through the (quite-slow) Infinity Fabric interconnect and possibly due to its saturation as well as the one of memory channels. I advise you to not trust OpenMP runtime's automatic thread binding (use OMP_PROC_BIND=TRUE), to rather perform the thread/memory bindings manually and then to report bugs if needed.

Here is an example of a resulting command line so as to run your application: numactl --localalloc OMP_PROC_BIND=TRUE OMP_PLACES="{0},{1},{2},{3},{4},{5},{6},{7}" ./app

PS: be careful about PU/core IDs and logical/physical IDs.

Thanks a lot! Very informative. I will follow your suggestions and do a thorough study. I also received an HPC tuning guide for the AMD EPYC machine, need to read through it as well. Really appreciate your answer. — AeroD, Oct 18 '20 at 15:42
Note that the Intel/LLVM OpenMP runtime code is available in the LLVM distribution, and that it uses hwloc to understand the machine hardware layout. It therefore ought to be able to get it right :-) — Jim Cownie, Oct 19 '20 at 08:35
Thanks again! I will keep playing with the numa effect next. The multithreading in our CFD program follows the original MPI parallelization paradigm, i.e. partition the CFD mesh into many parts and each thread works on one part. So the communication among those threads are minimized through the mesh partitioning algorithm (metis, chaco,...) . I assume naturally a thread will allocate the memory of the mesh part it works on as local as possible, so NUMA effect might not be severe. — AeroD, Oct 19 '20 at 22:56
It is no longer that easy to implement NUMA-aware code that takes advantage of the first-touch NUMA policy as the automatic NUMA balancer in Linux 3.x and newer, if enabled (which the case of RHEL and derivates), constantly moves memory pages between domains based on a couple of heuristics. One must use special thread-aware memory allocators or the functions in `libnuma` (or disable the balancer altogether). — Hristo Iliev, Oct 20 '20 at 09:07

Intel OpenMP library slows down memory bandwidth significantly on AMD platforms by setting KMP_AFFINITY=scatter

1 Answers1

Linked