2

I want to run a MPI Version of the STREAM benchmark on one node of a cluster to measure the sustainable bandwidth for different numbers of MPI processes. Each node consists of two Intel® Xeon® Processor E5-2680 v3 (12 cores).

In the following I will present the results in MByte/s for the triad test, using openmpi v. 1.8.2 and the option map-by core. The source was compiled using a wrapper for

icc (ICC) 15.0.0 20140723

with the compiler options

-O3 -xHost.

Each core has been using double precision arrays with the length 2*10^6, the tests are repeated 50 times:

01: Triad: 22017.3438
02: Triad: 29757.8394
03: Triad: 30224.1759
04: Triad: 30080.7369
05: Triad: 30209.6233
06: Triad: 30028.2044
07: Triad: 35064.7215
09: Triad: 44961.7710
10: Triad: 49721.1975
11: Triad: 54814.0579
12: Triad: 58962.7279
13: Triad: 64405.3634
14: Triad: 69330.3864
15: Triad: 74137.0623
16: Triad: 78838.8075
17: Triad: 84006.1067
18: Triad: 89012.6674
19: Triad: 94105.8690
20: Triad: 98744.3634
21: Triad: 103948.1538
22: Triad: 108055.3862
23: Triad: 114154.4542
24: Triad: 118730.5429

What puzzles my is the stagnation of the measured sustainable bandwidth for 2-6 processes. I think that the Turbo Boost of the used processors might be biasing the results. Apparently the Turbo Boost only becomes active when a few cores are used, but I am not sure how to interpret the results correctly.

To "turn off" the Turbo Boost one possibility could be to modify the STREAM benchmark as follows:

  • Always use all 24 available cores to maintain the processor load approximately constant throughout the benchmark
  • Split the MPI communicator MPI_COMM_WORLD of the benchmark into two separate ones. One communicator will be associated with the group of N processes that actually run the benchmark, the other group contains the remaining 24-N processes that only have the purpose to keep the cores busy and prevent Turbo Boost
  • The 24-N processes only use data from the L1-cache to minimize effects on the other processes
  • Extend the time the benchmark runs to several minutes (Turbo Boost effects do not last longer than a few seconds?)

A first implementation uses only the triad kernel for the remaining 24-N processes on a data set that should fit into the L1-cache (32kByte). The number of repetitions of the Benchmark has been raised to 5000 and the size of the data array length has been raised to 1*10^7. The function foo is called to prevent (successfully?) loop unrolling, the preceding if condition is false throughout the entire program execution. The relevant code reads

  if (rank < N) {
    ...
    UNMODIFIED BENCHMARK
    ...
  } else {
    scalar=3.0;
    for (i=0; i<10000000000; i++) {
      for (j=0; j<1000; j++) a[j] = b[j]+scalar*c[j];
      if (a[2] < 1.0) foo(a,b,c);
    }   
  }

a,b and c are declared as static double a[LEN],b[LEN],c[LEN] outside of main. I also read this discussion about the static modifier and its effects on the benchmark. I am not sure if calling foo is really necessary.

Unfortunately my approach didn't lead to different results from the benchmark.

What is wrong with my approach? Do you know about alternatives or do you have suggestions on how I should modify my benchmark?


EDIT:

Since one user showed interest in the code I used for benchmarking:

I used a modified version of the STREAM benchmark coming with PETSc:

http://www.mcs.anl.gov/petsc/

Once you install PETSc on your system you find the benchmarks in src/benchmarks/streams

If you are only interested in the source code you will find the most recent version on their GitHub page:

https://github.com/petsc/petsc/tree/master/src/benchmarks/streams

osgx
  • 90,338
  • 53
  • 357
  • 513
el_tenedor
  • 644
  • 1
  • 8
  • 19
  • Turbo Boost can normally be turned of in BIOS (on normal PCs). – Roy Longbottom Feb 17 '15 at 15:36
  • I know, but I don't have this possibility on our cluster. – el_tenedor Feb 17 '15 at 15:41
  • Hi there. Did you come up with any solution? – Jakuje Dec 09 '15 at 19:22
  • @Jakuje: Unfortunately I didn't come up with a solution. – el_tenedor Dec 10 '15 at 13:50
  • 1
    el_tenedor, you may *measure* current cpu clock externally (`perf stat` / `perf stat -e cpu-clock,cycles` will do this) or internally (read wall time and read cycles performance counter). Internal reading of cpu cycles (to compute mean frequency on program fragment) may be done with some PMU api like perfmon, libpfm3/libpfm4, PAPI. – osgx Jun 26 '16 at 01:44
  • Why do you even want to defeat Turbo? (**update, oh NVM, I see you want to test if Turbo is the cause of something you see in the results**). If Turbo lets the CPUs run faster when they're not all busy, shouldn't the results reflect that? Or are you trying to simulate a "busy" cluster where the other cores will be loaded with another job? Just run 24-N simple infinite loops or something. Or http://cpuburnin.com/ to generate a lot of heat, to make it harder for any cores to Turbo. (Not all infinite loops are created equal when it comes to thermal / power considerations). – Peter Cordes Jul 22 '16 at 01:45

1 Answers1

1

You may measure current cpu clock externally with some profiling tool, which has access to hardware performance monitoring unit to get real count of cpu cycles, used for program. For example, linux perf does this in perf stat mode (or perf stat -e cpu-clock,cycles or perf stat -a -A for system-wide), it prints mean CPU frequency in line with cycles when it has both cycles event counted and some wall time event like task-clock or cpu-clock.

We can compute this internally to program by reading wall time (gettimeofday) and reading cycles performance counter (but not RDTSC, because most TSC are now Invariant, "invariant TSC is indicated by CPUID.80000007H:EDX[8]"). Mean_Frequency = cycles_consumed / spent_time. We can do it just before and just after test functions or around every test function (if Copy/Scale/ADD stream tests are enabled), to get more detailed frequency profile (there are HPC tracing projects, capable of recording time and cycles on every MPI call).

Internal reading of cpu cycles (to compute mean frequency on program fragment) may be done with some PMU api like perf_events (Linux, this is used by perf), perfmon, libpfm3/libpfm4, PAPI or other PMU access lib used in your part of HPC world. Some PMU APIs are portable between OSes and CPU architectures.

Alternatively you may try to: disable Boost (https://stackoverflow.com/a/38034503/196561) or to permanently enabling it using water cooling with very cool water (may not work in world with 15+ core Xeons).

Community
  • 1
  • 1
osgx
  • 90,338
  • 53
  • 357
  • 513