It's been a while since I used MPI, so I'm not really answering the "how to write the code" question. I'm focusing more on the benchmark methodology side of things, so you hopefully can design it to actually measure something useful. Benchmarking is hard; it's easy to get a number, hard to get a meaningful number that measures what you wanted to measure.
Instead of specifying which nodes you get, you could just query which nodes you got. (i.e. detect the case where multiple processes of your MPI job ended up on the same physical host, competing for memory bandwidth.)
You could also randomize how many threads you run on each node, or something, to see how bandwidth scales with number of threads doing a memcpy, memset, or something read-only like a reduction or memcmp.
One thread per machine won't come close to saturating memory bandwidth, on recent Intel Xeons, except maybe for low-core-count CPUs that are similar to desktop CPUs. (And then only if your code compiles to efficient vectorized asm). L3 / memory latency is too high for the limited memory parallelism of a single core to saturate throughput. (See Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?, and "latency-bound platforms" in Enhanced REP MOVSB for memcpy.)
It can take 4 to 8 threads running bandwidth-bottlenecked code (like a STREAMS benchmark) to saturate the memory bandwidth of a many-core Xeon. More threads than that will have about the same total, unless you test with quite small arrays so the private per-core L2 cache comes into play. (256kB on modern Intel CPUs, vs. the large shared ~2MB per core L3). Update: 1 MiB per core private L2 on Skylake-AVX512.
With dual-socket nodes, NUMA is a factor. If your threads end up using memory that all maps to the physical memory controllers on one socket, leaving the other socket's memory controllers idle, you'll only see half the machine's bandwidth. This might be a good way to test that your kernel's NUMA-aware physical memory allocation does a good job for your actual workload. (If your bandwidth microbenchmark is anything like your real workloads)
Keep in mind that memory bandwidth is a shared resource for all cores on a node, so for repeatable results you're going to want to avoid competing with other loads. Even something with a small memory footprint can use a lot of bandwidth, if its working set doesn't fit in private per-core L2 caches, so don't assume that another job won't compete for memory bandwidth just because it only uses a couple hundred MB.