A C++ OpenMP program compiled with intel compiler is submitted on a cluster node using SLURM job scheduler with two different parameters on two different directories. The number of threads is 20 for each. But one program is running with 1700-2000%CPU (which is ok) and the other is running with 500-950%CPU, almost one-third or half of the first one. What is causing this kind of performance difference? The same thing happens when the program runs on different nodes of the same configuration. Here is the part of the code where OpenMP is used.
#define NUMBER_OF_THREADS 20
...
...
void someFunction(){
#pragma omp parallel for num_threads(NUMBER_OF_THREADS) collapse(2)
for (size_t i = 0; i < NX; i++) {
for (size_t j = 0; j < NY; j++) {
// work to be done
}
}
}
Here is the SLURM job submission script,
#!/bin/bash
#SBATCH -J name
#SBATCH -p partitionName
#SBATCH -n 1 # no of processes
#SBATCH --cpus-per-task=20
module load compiler/intel/2019.5.281
cd my_working_directory
path_of_the_executable
Here is the node details
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 1
Core(s) per socket: 20
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz