I want to run a MPI Version of the STREAM benchmark on one node of a cluster to measure the sustainable bandwidth for different numbers of MPI processes. Each node consists of two Intel® Xeon® Processor E5-2680 v3 (12 cores).
In the following I will present the results in MByte/s for the triad test, using openmpi v. 1.8.2 and the option map-by core
. The source was compiled using a wrapper for
icc (ICC) 15.0.0 20140723
with the compiler options
-O3 -xHost.
Each core has been using double precision arrays with the length 2*10^6, the tests are repeated 50 times:
01: Triad: 22017.3438
02: Triad: 29757.8394
03: Triad: 30224.1759
04: Triad: 30080.7369
05: Triad: 30209.6233
06: Triad: 30028.2044
07: Triad: 35064.7215
09: Triad: 44961.7710
10: Triad: 49721.1975
11: Triad: 54814.0579
12: Triad: 58962.7279
13: Triad: 64405.3634
14: Triad: 69330.3864
15: Triad: 74137.0623
16: Triad: 78838.8075
17: Triad: 84006.1067
18: Triad: 89012.6674
19: Triad: 94105.8690
20: Triad: 98744.3634
21: Triad: 103948.1538
22: Triad: 108055.3862
23: Triad: 114154.4542
24: Triad: 118730.5429
What puzzles my is the stagnation of the measured sustainable bandwidth for 2-6 processes. I think that the Turbo Boost of the used processors might be biasing the results. Apparently the Turbo Boost only becomes active when a few cores are used, but I am not sure how to interpret the results correctly.
To "turn off" the Turbo Boost one possibility could be to modify the STREAM benchmark as follows:
- Always use all 24 available cores to maintain the processor load approximately constant throughout the benchmark
- Split the MPI communicator
MPI_COMM_WORLD
of the benchmark into two separate ones. One communicator will be associated with the group of N processes that actually run the benchmark, the other group contains the remaining 24-N processes that only have the purpose to keep the cores busy and prevent Turbo Boost - The 24-N processes only use data from the L1-cache to minimize effects on the other processes
- Extend the time the benchmark runs to several minutes (Turbo Boost effects do not last longer than a few seconds?)
A first implementation uses only the triad kernel for the remaining 24-N processes on a data set that should fit into the L1-cache (32kByte). The number of repetitions of the Benchmark has been raised to 5000 and the size of the data array length has been raised to 1*10^7. The function foo
is called to prevent (successfully?) loop unrolling, the preceding if condition is false
throughout the entire program execution. The relevant code reads
if (rank < N) {
...
UNMODIFIED BENCHMARK
...
} else {
scalar=3.0;
for (i=0; i<10000000000; i++) {
for (j=0; j<1000; j++) a[j] = b[j]+scalar*c[j];
if (a[2] < 1.0) foo(a,b,c);
}
}
a,b and c are declared as static double a[LEN],b[LEN],c[LEN]
outside of main. I also read this discussion about the static modifier and its effects on the benchmark. I am not sure if calling foo
is really necessary.
Unfortunately my approach didn't lead to different results from the benchmark.
What is wrong with my approach? Do you know about alternatives or do you have suggestions on how I should modify my benchmark?
EDIT:
Since one user showed interest in the code I used for benchmarking:
I used a modified version of the STREAM benchmark coming with PETSc:
Once you install PETSc on your system you find the benchmarks in src/benchmarks/streams
If you are only interested in the source code you will find the most recent version on their GitHub page:
https://github.com/petsc/petsc/tree/master/src/benchmarks/streams