1

I notice numactl has some strange impact on stream benchmark

More specifically, "numactl ./stream_c.exe" reports 40% lower memory bandwidth than "./stream_c.exe".

I checked numactl source code and don't see anything special it should do if I don't give it any parameter. So I would naively expect numactl doesn't have performance impact in "numactl ./stream_c.exe", which is not true according to my experiment.

This is a two-socket server with high-core-count processors.

Using numastat, I can see that numactl command cause the memory allocation to be unbalanced: the two numa nodes split the memory allocation by 80:20.

Without numactl, the memory is allocated in a much balanced way: 46:54.

I also found that this is not only a numactl issue. If I use perf to invoke stream_c.exe, the memory allocation is even more unbalanced than using numactl.

So this is more like a kernel question: how do numactl and perf change the memory placement policy for the sub-processes? Thanks!

yeeha
  • 139
  • 2
  • 8
  • What hardware do you have? You can reproduce this reliably across multiple runs? – Peter Cordes Jun 27 '20 at 00:06
  • This is an AMD EPYC 7742 server. It looks it is consistent on other EPYC-7742. Other EPYC 7752 servers have the similar issues. Other high-core count Skylake server have no or less obvious problems, but I can still observe some. – yeeha Jun 27 '20 at 05:22
  • What is the default policy of numactl (you can find it with `numactl -h`)? Did you modify the stream benchmark code so that it can work on in-memory data and note *in-cache data*? What is the resulting throughput with and without numactl? – Jérôme Richard Jun 27 '20 at 08:23
  • @JérômeRichard '-h' doesn't show anythiny. '-s' shows: "policy: default preferred node: current", which is expected since I didn't change any cofiguration. Yes, I did make sure the benchmark has a large memory footprint to exercise memory bandwidth. – yeeha Jun 27 '20 at 16:56

1 Answers1

2

TL;DR: The default policy used by numactl can cause performances issues as well as the OpenMP thread binding. numactl constraints are applied to all (forked) children process.

Indeed, numactl use a predefined policy by default. This policy is can be --interleaved, --preferred, --membind, --localalloc. This policy change the behavior of the operating system page allocation when a first touch on the page is done. Here are the meaning of policies:

  • --interleaved: memory pages are allocated across nodes specified by a nodeset, but are allocated in a round-robin fashion;
  • --preferred: memory is allocated from a single preferred memory node. If sufficient memory is not available, memory can be allocated from other nodes.;
  • --membind: only allocate memory from nodes. Allocation will fail when there is not enough memory available on these nodes;
  • --localalloc: always allocate on the current node (the one performing the first touch of the memory page).

In your case, specifying an --interleaved or a --localalloc policy should give better performance. I guess that the --localalloc policy should be the best choice if threads are bound to cores (see bellow).

Moreover, the STREAM_ARRAY_SIZE macro is set to a too small value (10 Mo) by default to actually measure the performance of the RAM. Indeed, the AMD EPYC 7742 processor have a 256Mio L3 cache which is big enough to fit all data of the benchmark in it. It is probably much better to compare the results with a value bigger than the L3 cache (eg. 1024 Mio).

Finally, OpenMP threads may migrate from one numa node to another. This can drastically decrease the performance of the benchmark since when a thread move to another node, the accessed memory pages are located on a remote node and the target NUMA node can be saturated. You need to bind OpenMP threads so they cannot move using for example the following environment variables for this specific processor: OMP_NUM_THREADS=64 OMP_PROC_BIND=TRUE OMP_PLACES="{0}:64:1" assuming SMT is disabled and the core IDs are in the range 0-63. If the SMT is enabled, you should tune the OMP_PLACES variable using the command: OMP_PLACES={$(hwloc-calc --sep "},{" --intersect PU core:all.pu:0)} (which require the package hwloc to be installed on the machine). You can check the thread binding with the command hwloc-ps.


Update:

numactl impacts the NUMA attributes of all the children processes (since children process are created using fork on Linux which should copy the NUMA attributes). You can check that with the following (bash) command:

numactl --show
numactl --physcpubind=2 bash -c "(sleep 1; numactl --show; sleep 2;) & (sleep 2; numactl --show; sleep 1;); wait"

This command first show the NUMA attributes. Set it for a child bash process which launch 2 process in parallel. Each sub-child process show NUMA attributes. The result is the following on my 1-node machine:

policy: default                    # Initial NUMA attributes
preferred node: current
physcpubind: 0 1 2 3 4 5 
cpubind: 0 
nodebind: 0 
membind: 0 
policy: default                    # Sub-child 1
preferred node: current
physcpubind: 2 
cpubind: 0 
nodebind: 0 
membind: 0 
policy: default                    # Sub-child 2
preferred node: current
physcpubind: 2 
cpubind: 0 
nodebind: 0 
membind: 0 

Here, we can see that the --physcpubind=2 constraint is applied to the two sub-child processes. It should be the same with --membind or the NUMA policy on your multi-node machine. Thus, note that calling perf before the sub-child processes should not have any impact on the NUMA attributes.

perf should not impact directly the allocations. However, the default policy of the OS could balance the amount of allocated RAM on each NUMA node. If this is the case, the page allocations can be unbalanced, decreasing the bandwidth due to the saturation of the NUMA nodes. The OS memory page balancing between NUMA nodes is very sensitive: writing a huge file between two benchmarks can impact the second run for example (and thus impact the performance of the benchmark). This is why NUMA attributes should always be set manually for HPC benchmarks (as well as process binding).


PS: I assumed the stream benchmark has been compiled with the OpenMP support and that optimizations has been enabled too (ie. -fopenmp -O2 -march=native on GCC and Clang).

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • Thanks for the tips on hwloc. I had a guess that if I invoke stream with numactl or perf, stream probably inherits the memory placement preference from numact/perf, which settle the policy to be preferable to the nuna node they are running on. Therefore stream inherits this preference, causing the memory location imbalance. Is that something possible? If the answer is yes, it is not a desirable behavior since what you observe with "perf" tools might be far away from the truth with perf. – yeeha Jun 27 '20 at 21:22
  • @yeeha I added more information related to your question. Do they answer your questions? – Jérôme Richard Jun 28 '20 at 10:12
  • 1
    Zen has 8MiB of L3 per cluster of 4 cores. Unlike Intel, multiple clusters do *not* form a large shared L3; the most cache a single core can benefit from directly (with normal L3 hits) is approximately 8MiB. IDK if cache->cache transfers can let one cluster benefit from lines being hot in another cluster; maybe for read but unlikely that a large memset could push its lines into the L3 of another cluster. Agreed that much larger than 8M would be a better to simplify what you're measuring, though. But interesting to see if 10MiB has any different effects for a single-threaded test. – Peter Cordes Jun 28 '20 at 11:05
  • @JérômeRichard Thanks! I still cannot connect the dots. I used "-fopenmp -O2 -DSTREAM_ARRAY_SIZE=80000000" when I build stream. Each array has 80M elements (640MB size). Therefore 3 arrays would occupy around 1.92GB. My DRAM per socket is much higher than 2GB. I still don't see why the 2 different invocation paths (direct invocation of stream or invoke stream via numactl/perf) could have different memory placement policies for stream. "the default policy of the OS could balance the amount of allocated RAM on each NUMA node." Can you give some examples what could happen? – yeeha Jun 29 '20 at 07:36