How to measure a MPI benchmark per node using the source code?

Question

I'm wondering how it is possible to measure memory bandwidth(Stream benchmark) PER NODE. This program that I have, measures it on one node only, the number of processes and threads would be taken as followed:

MPI_Comm_size(MPI_COMM_WORLD, &numranks);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
omp_set_dynamic(0);    
omp_set_num_threads(4);
#pragma omp parallel
{
}

It is actually a mix of mpi and openmp. Is there any way to specify the node and do the measurements for each node? (suppose I have 4 nodes) I would like to do it using the source code NOT the slurm-batch script. I can specify the first core of each node using process-ID but how would it run the measurements on the entire node each time(include any number of cores)?

Any suggestion would be appreciate.

Can you clarify what you mean by "node" ? NUMA node (e.g. socket) ? Host (that might have several sockets) ? — Gilles Gouaillardet, Feb 19 '18 at 14:44

Peter Cordes · Answer 1 · 2018-02-17T17:32:05.953

It's been a while since I used MPI, so I'm not really answering the "how to write the code" question. I'm focusing more on the benchmark methodology side of things, so you hopefully can design it to actually measure something useful. Benchmarking is hard; it's easy to get a number, hard to get a meaningful number that measures what you wanted to measure.

Instead of specifying which nodes you get, you could just query which nodes you got. (i.e. detect the case where multiple processes of your MPI job ended up on the same physical host, competing for memory bandwidth.)

You could also randomize how many threads you run on each node, or something, to see how bandwidth scales with number of threads doing a memcpy, memset, or something read-only like a reduction or memcmp.

One thread per machine won't come close to saturating memory bandwidth, on recent Intel Xeons, except maybe for low-core-count CPUs that are similar to desktop CPUs. (And then only if your code compiles to efficient vectorized asm). L3 / memory latency is too high for the limited memory parallelism of a single core to saturate throughput. (See Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?, and "latency-bound platforms" in Enhanced REP MOVSB for memcpy.)

It can take 4 to 8 threads running bandwidth-bottlenecked code (like a STREAMS benchmark) to saturate the memory bandwidth of a many-core Xeon. More threads than that will have about the same total, unless you test with quite small arrays so the private per-core L2 cache comes into play. (256kB on modern Intel CPUs, vs. the large shared ~2MB per core L3). Update: 1 MiB per core private L2 on Skylake-AVX512.

With dual-socket nodes, NUMA is a factor. If your threads end up using memory that all maps to the physical memory controllers on one socket, leaving the other socket's memory controllers idle, you'll only see half the machine's bandwidth. This might be a good way to test that your kernel's NUMA-aware physical memory allocation does a good job for your actual workload. (If your bandwidth microbenchmark is anything like your real workloads)

Keep in mind that memory bandwidth is a shared resource for all cores on a node, so for repeatable results you're going to want to avoid competing with other loads. Even something with a small memory footprint can use a lot of bandwidth, if its working set doesn't fit in private per-core L2 caches, so don't assume that another job won't compete for memory bandwidth just because it only uses a couple hundred MB.

Thank you for your reply. Actually, I still could not get how I can specify the nodes...I'm using "Stream Benchmark" which gives me a memory-bandwidth on the nodes. I mean if I use two node(16 threads/cores per node), it gives me a memory bandwidth that measured on 32 cores. I want to get the memory bandwidth of each node separately (like one memorybw for node1, the other for node2). How is it possible? sorry I am quite new in the parallel programming... — Matrix, Aug 04 '16 at 13:50
@Sarah: Oh, you didn't say that you were using an out-of-the-box benchmark that totals things up. It looked from your code like you were building your own from scratch, which is what I was answering. It sounds like you need to modify the code of the benchmark to record things before they're reduced to a single sum. That's a very different question. — Peter Cordes, Aug 04 '16 at 15:18
Of course, you just just run separate single-node jobs if you aren't interested in the bandwidth between nodes. — Peter Cordes, Aug 04 '16 at 15:20
well I run my batch script using more than one node, I do NOT want to measure the memory-bw between the node. I need to measure them on EACH NODE separately (like first on node 1, then on node2,...). I have to use only one job script for this aim. So it needs to modify the code as you told. But I have no clue how... — Matrix, Aug 04 '16 at 15:28
@Sarah: Why can't you run a separate job for each node? Is there some kind of quota or something on a shared cluster you're using? It seems like you could get exactly what you wanted from a shell script that submitted multiple one-node jobs and combined the results. — Peter Cordes, Aug 04 '16 at 15:39

How to measure a MPI benchmark per node using the source code?

1 Answers1