2

I am trying to benchmark three different applications. All of them are written in C++ using MPI and OpenMP and are compiled with gcc7.1 and OpenMPI3.0. I use a cluster with several nodes and 2 Intel CPUs with 24 cores. There is one process running on each node and on each node parallelization is done with OpenMP.

Edit: This is the shortest benchmark, I was testing custom reduction operations:

#include <mpi.h>
#include <omp.h>
#include <vector>
#include <chrono>

int process_id = -1;

std::vector<double> values(268435456, 0.1);

void sum(void *in, void *inout, int *len, MPI_Datatype *dptr){
    double* inv = static_cast<double*>(in);
    double* inoutv = static_cast<double*>(inout);   
    *inoutv = *inoutv + *inv;
} 

int main(int argc, char** argv) {
  MPI_Init(&argc, &argv);   
  int mpi_world_size = 0;
  MPI_Comm_size(MPI_COMM_WORLD, &mpi_world_size);   
  MPI_Comm_rank(MPI_COMM_WORLD, &process_id);

  #pragma omp declare reduction(sum : double : omp_out = omp_out + omp_in) initializer(omp_priv = omp_orig)

  MPI_Op sum_mpi_op;
  MPI_Op_create( sum, 0, &sum_mpi_op );
  double tmp_result = 0.0;
  double result = 0.0;

  std::chrono::high_resolution_clock::time_point timer_start = std::chrono::high_resolution_clock::now();

  #pragma omp parallel for simd reduction(sum:tmp_result)
  for(size_t counter = 0; counter < 268435456; ++counter){
    tmp_result = tmp_result + values[counter];
  }     

  MPI_Allreduce(&tmp_result, &result, sizeof(double), MPI_BYTE, sum_mpi_op, MPI_COMM_WORLD); 
  std::chrono::high_resolution_clock::time_point timer_end = std::chrono::high_resolution_clock::now();
  double seconds = std::chrono::duration<double>(timer_end - timer_start).count();

  if(process_id == 0){
    printf("Result: %.5f; Execution time: %.5fs\n", result, seconds);
  }

  MPI_Finalize();
  return EXIT_SUCCESS;
}

I observe that the execution time for all benchmarks varies between two values, e.g. for Benchmark A, I have 10 runs and 5 take about 0.6s and 5 take about 0.73s (+/- a bit). For Benchmark B it is the same but the exec time is either 77s or 85s (again +/-). Equivalent results for Benchmark C. So there is nothing in between. I measure the time with std::chrono:high_resolution_clock:

std::chrono::high_resolution_clock::time_point timer_start = std::chrono::high_resolution_clock::now();

// do something

std::chrono::high_resolution_clock::time_point timer_end = std::chrono::high_resolution_clock::now();
double seconds = std::chrono::duration<double>(timer_end - timer_start).count();

Slurm is used as a batch system and I use the exclusive option to make sure that there are no other jobs running on the nodes. For the Slurm job I use basically the following file:

 #!/bin/bash
 #SBATCH --ntasks 4
 #SBATCH --nodes 4
 #SBATCH --ntasks-per-node 1
 #SBATCH --exclusive
 #SBATCH --cpus-per-task 24
 export OMP_NUM_THREADS=24
 RUNS=10
 for ((i=1;i<=RUNS;i++)); do
   srun /path/bench_a
 done

For building the code I use CMake and set the flags

-O3 -DNDEBUG -march=haswell -DMPICH_IGNORE_CXX_SEEK -std=c++14

Since it is the same for all benchmarks, I don't believe the reason is the implementation, but something about the way I build the code or start the job.

Do you have any idea what I should be looking for to explain that behaviour? Thank you

Fabian
  • 547
  • 1
  • 4
  • 17
  • 1
    There are a million possible reasons for this. It's impossible to answer your question without much more detailed observations and a [mcve]. The best guess would be inhomogenous systems, e.g. slurm nodes with different processor frequencies. First step is probably to ask the administrator of the system. – Zulan Apr 09 '18 at 15:41
  • @Zulan Yes, I am aware that this is a very broad question. I added one example of the code. I also thought about the different nodes, but if the job starts, the nodes should be same till the job finishes, right? So even if I call srun ... x times, all x times the same nodes should be used. – Fabian Apr 09 '18 at 16:10
  • Can you repro the behaviour on a single node, taking MPI out of the picture? Is there ever any other competing load on your cluster? Have you tried setting your CPU frequency governor (intel_pstate) to "performance" and/or disabling turbo, to control for CPU frequency variation? – Peter Cordes Apr 09 '18 at 19:43
  • Have you tried aligning your array? Maybe the variation is between misaligned vs. 32-byte aligned? But IDK how that would happen the same way across all nodes every time. That's probably not it, anyway; misaligned loads should only matter if your data was hot in L1d or maybe L2 cache, on Haswell. – Peter Cordes Apr 09 '18 at 19:54
  • Do you have hyperthreading enabled? Maybe you're getting unlucky and the last 2 threads to finish are running on the same physical core, slowing each other. Have you profiled to see how much CPU time each thread takes, instead of just recording the time for the slowest thread by waiting for `MPI_Allreduce`? e.g. use `time ./my_program` to see total CPU time if you make a non-MPI version. – Peter Cordes Apr 09 '18 at 19:56
  • 1
    Make sure a MPI task is bound to a single socket, otherwise you will experience NUMA effect (and that is not a good one). Then add a second timer to figure out whether the fluctuation occurs in the OpenMP (e.g. cpu) part of the MPI ( e.g. communication) part. You should also measure imbalance between the tasks. Keep in mind that the first MPI communication might require some extra time to internally wireup everything, so you should issue at least one dummy `MPI_Allreduce()` before starting the timer. – Gilles Gouaillardet Apr 10 '18 at 00:12
  • @GillesGouaillardet: are you saying that using MPI might move your thread to a different core, *after* it allocates / inits its big `vector`? I think the OpenMP stuff is going to start its 24 threads after that happens, and long after the `vector` is allocated / initted (by a single core before `main` runs). So probably the array is all on the local memory of one node, so half the OpenMP threads are using NUMA remote memory. Unless you get lucky and the some physical pages of the array end up on each socket, and the thread scheduling happens to line up... – Peter Cordes Apr 10 '18 at 00:36
  • @PeterCordes binding a hybrid app is a two steps tango. First `srun` binds the MPI tasks to a subset of cores, and then the OpenMP threads are bound by the runtime within the previously assigned subset of cores. My point is that `srun` should bind a MPI task to a single socket instead of spreading it to (half) of the two sockets. MPI does not move threads because it is not even aware of them. – Gilles Gouaillardet Apr 10 '18 at 00:55
  • @GillesGouaillardet: Oh I see what you're saying now. It's been so long since I did anything with MPI that I thought you were saying something about a *network* socket (and that a blocking system call can result in a thread waking up on a different core). Yeah, the OP would get better performance from starting 2 MPI tasks per dual-socket node, one for each socket with `taskset`, instead of letting OpenMP loop over one large array with threads on both sockets. But that doesn't explain the (interesting?) bi-modal performance from having half the threads using NUMA "remote" memory, does it? – Peter Cordes Apr 10 '18 at 01:43
  • @PeterCordes I think this and the MPI wireup can explain the performance fluctuation. – Gilles Gouaillardet Apr 10 '18 at 01:47
  • 1
    An other option could be `srun` (OpenMP ?) performs no binding, so the placement of OpenMP threads change between runs. Worst case is two threads from two MPI tasks end up sharing the same core. Bottom line, it is critical to perform optimal binding in order to get best and stable performance. `srun --cpu_bind=verbose` can be used to show the MPI binding, and the OpenMP runtime can generally report the thread binding. – Gilles Gouaillardet Apr 10 '18 at 01:58
  • @GillesGouaillardet: So what do you think costs the extra 130 ms in the slow case, and why is it bi-modal instead of a continuous range of times? (Fabian: semi-related: [Intel memory-bandwidth info and links](https://stackoverflow.com/questions/39260020/why-is-skylake-so-much-better-than-broadwell-e-for-single-threaded-memory-throug). Different ways of configuring a dual-socket Xeon's memory snoop modes can produce different local vs. remote bandwidth/latency characteristics: https://software.intel.com/en-us/articles/intel-xeon-processor-e5-2600-v4-product-family-technical-overview) – Peter Cordes Apr 10 '18 at 02:02
  • @PeterCordes I think several ideas were provided, and the OP will double-check and/or update the benchmark, and see how this affects the performances. – Gilles Gouaillardet Apr 10 '18 at 02:08
  • 1
    @GillesGouaillardet: ok yeah, I'm with you on that. Further experiments from the OP to narrow down possible causes are basically needed before we can cook up any more specific theories. – Peter Cordes Apr 10 '18 at 02:28
  • I agree with the suggestion that remote memory access within a node is a strong possibility. Memory needs to be "first touched" by the same thread which performs the timed work, requiring affinity to a single socket be set and that first access be in a parallel region with the same structure. – tim18 Apr 10 '18 at 13:01
  • I was surprised at one aspect of this – tim18 Apr 10 '18 at 13:03
  • I wasn't aware that gcc may have actually implement parallel simd reduction. That aside, the usual default in case a socket has no more physical memory at the time of first touch is to attempt remote memory allocation. I actually encountered this as the cause of an effect similar to that quoted here. – tim18 Apr 10 '18 at 13:19

1 Answers1

-1

This is a usual problem for benchmarking... Usually benchmark are performed 10'000 times and then averaged. Fluctuation of execution time bellow 10% are difficult to avoid.

Does your cluster have 4 node only? A reason which can explains it is the network usage especialy if the comm time is in fact a quite large % of the execution time (as you might not be the only one running at that time on the cluster). The only way to avoid this issue is to run you benchmark with the entire cluster or within a reservation of the entire cluster. You should ask the IT what they prefer, especially if you have a given budget on the cluster. But usually 10'000 runs give a pretty good estimate of your execution time.

(you are right about multiple srun in a given slurm script, run always all benchmark you want to compare in the same script :-))

David Daverio
  • 323
  • 1
  • 11
  • I don't think there's going to be a lot of communication time. Each process allocates / inits its own copy of the `vector`, so the only network communication is MPI init and sending one `double` result, and detecting that all MPI nodes have finished. – Peter Cordes Apr 09 '18 at 20:10
  • I think it could be surprising :-). The question is not the total execution time but the benchmark part. I have been more than once surprised how much time a small message send/rec can take! – David Daverio Apr 09 '18 at 23:05
  • Yeah, I asked the OP if they could repro it on a single node, to rule out MPI altogether. It seems strange and unlikely that network congestion could produce that bi-modal time distribution, though, rather than a range of slowdowns. But CPUs can sometimes do that. – Peter Cordes Apr 09 '18 at 23:07
  • 1
    You do not need to run 10'000 times in order to figure out your app is non deterministic and/or very sensitive to other jobs running on the cluster. unless you run into severe congestion, sending a small message virtually returns immediately, and the receive spend most of the time waiting for the sender (e.g. imbalance at the app level and not pure interconnect performance) – Gilles Gouaillardet Apr 10 '18 at 00:27
  • CUG2018 standard for MPI calls. – David Daverio Apr 10 '18 at 00:41