For memory-bound programs it is not always faster to use many threads, say the same number as the cores, since threads may compete for memory channels. Usually on a two-socket machine, less threads are better but we need to set affinity policy that distributes the threads across sockets to maximize the memory bandwidth.
Intel OpenMP claims that KMP_AFFINITY=scatter is to achieve this purpose, the opposite value "compact" is to place threads as close as possible. I have used ICC to build the Stream program for benchmarking and this claim is easily validated on Intel machines. And if OMP_PROC_BIND is set, the native OpenMP env vars like OMP_PLACES and OMP_PROC_BIND are ignored. You will get such a warning:
OMP: Warning #181: OMP_PROC_BIND: ignored because KMP_AFFINITY has been defined
However, a benchmark on a newest AMD EPYC machine I obtained shows really bizarre results. KMP_AFFINITY=scatter gives the slowest memory bandwidth possible. It seems that this setting is doing exactly the opposite on AMD machines: placing threads as close as possible so that even the L3 cache at each NUMA node is not even fully utilized. And if I explicitly set OMP_PROC_BIND=spread, it is ignored by Intel OpenMP as the warning above says.
The AMD machine has two sockets, 64 physical cores per socket. I have tested using 128, 64, and 32 threads and I want them to be spread across the whole system. Using OMP_PROC_BIND=spread, Stream gives me a triad speed of 225, 290, and 300 GB/s, respectively. But once I set KMP_AFFINITY=scatter, even when OMP_PROC_BIND=spread is still present, Streams gives 264, 144, and 72 GB/s.
Notice that for 128 threads on 128 cores, setting KMP_AFFINITY=scatter gives better performance, this even further suggests that in fact all the threads are placed as close as possible, but not scattering at all.
In summary, KMP_AFFINITY=scatter displays completely opposite (in the bad way) behavior on AMD machines and it will even overwrite native OpenMP environment regardless the CPU brand. The whole situation sounds a bit fishy, since it is well known that ICC detects the CPU brand and uses the CPU dispatcher in MKL to launch the slower code on non-Intel machines. So why can't ICC simply disable KMP_AFFINITY and restore OMP_PROC_BIND if it detects a non-Intel CPU?
Is this a known issue to someone? Or someone can validate my findings?
To give more context, I am a developer of commercial computational fluid dynamics program and unfortunately we links our program with ICC OpenMP library and KMP_AFFINITY=scatter is set by default because in CFD we must solve large-scale sparse linear systems and this part is extremely memory-bound. I found that with setting KMP_AFFINITY=scatter, our program becomes 4X slower (when using 32 threads) than the actual speed the program can achieve on the AMD machine.
Update:
Now using hwloc-ps I can confirm that KMP_AFFINITY=scatter is actually doing "compact" on my AMD threadripper 3 machine. I have attached the lstopo result. I run my CFD program (built by ICC2017) with 16 threads. OPM_PROC_BIND=spread can place one thread in each CCX so that L3 cache is fully utilized. Hwloc-ps -l -t gives:
While setting KMP_AFFINITY=scatter, I got
I will try the latest ICC/Clang OpenMP runtime and see how it works.