I am trying to benchmark three different applications. All of them are written in C++ using MPI and OpenMP and are compiled with gcc7.1 and OpenMPI3.0. I use a cluster with several nodes and 2 Intel CPUs with 24 cores. There is one process running on each node and on each node parallelization is done with OpenMP.
Edit: This is the shortest benchmark, I was testing custom reduction operations:
#include <mpi.h>
#include <omp.h>
#include <vector>
#include <chrono>
int process_id = -1;
std::vector<double> values(268435456, 0.1);
void sum(void *in, void *inout, int *len, MPI_Datatype *dptr){
double* inv = static_cast<double*>(in);
double* inoutv = static_cast<double*>(inout);
*inoutv = *inoutv + *inv;
}
int main(int argc, char** argv) {
MPI_Init(&argc, &argv);
int mpi_world_size = 0;
MPI_Comm_size(MPI_COMM_WORLD, &mpi_world_size);
MPI_Comm_rank(MPI_COMM_WORLD, &process_id);
#pragma omp declare reduction(sum : double : omp_out = omp_out + omp_in) initializer(omp_priv = omp_orig)
MPI_Op sum_mpi_op;
MPI_Op_create( sum, 0, &sum_mpi_op );
double tmp_result = 0.0;
double result = 0.0;
std::chrono::high_resolution_clock::time_point timer_start = std::chrono::high_resolution_clock::now();
#pragma omp parallel for simd reduction(sum:tmp_result)
for(size_t counter = 0; counter < 268435456; ++counter){
tmp_result = tmp_result + values[counter];
}
MPI_Allreduce(&tmp_result, &result, sizeof(double), MPI_BYTE, sum_mpi_op, MPI_COMM_WORLD);
std::chrono::high_resolution_clock::time_point timer_end = std::chrono::high_resolution_clock::now();
double seconds = std::chrono::duration<double>(timer_end - timer_start).count();
if(process_id == 0){
printf("Result: %.5f; Execution time: %.5fs\n", result, seconds);
}
MPI_Finalize();
return EXIT_SUCCESS;
}
I observe that the execution time for all benchmarks varies between two values, e.g. for Benchmark A, I have 10 runs and 5 take about 0.6s and 5 take about 0.73s (+/- a bit). For Benchmark B it is the same but the exec time is either 77s or 85s (again +/-). Equivalent results for Benchmark C. So there is nothing in between. I measure the time with std::chrono:high_resolution_clock:
std::chrono::high_resolution_clock::time_point timer_start = std::chrono::high_resolution_clock::now();
// do something
std::chrono::high_resolution_clock::time_point timer_end = std::chrono::high_resolution_clock::now();
double seconds = std::chrono::duration<double>(timer_end - timer_start).count();
Slurm is used as a batch system and I use the exclusive option to make sure that there are no other jobs running on the nodes. For the Slurm job I use basically the following file:
#!/bin/bash
#SBATCH --ntasks 4
#SBATCH --nodes 4
#SBATCH --ntasks-per-node 1
#SBATCH --exclusive
#SBATCH --cpus-per-task 24
export OMP_NUM_THREADS=24
RUNS=10
for ((i=1;i<=RUNS;i++)); do
srun /path/bench_a
done
For building the code I use CMake and set the flags
-O3 -DNDEBUG -march=haswell -DMPICH_IGNORE_CXX_SEEK -std=c++14
Since it is the same for all benchmarks, I don't believe the reason is the implementation, but something about the way I build the code or start the job.
Do you have any idea what I should be looking for to explain that behaviour? Thank you