1

Snapshot of top command

A C++ OpenMP program compiled with intel compiler is submitted on a cluster node using SLURM job scheduler with two different parameters on two different directories. The number of threads is 20 for each. But one program is running with 1700-2000%CPU (which is ok) and the other is running with 500-950%CPU, almost one-third or half of the first one. What is causing this kind of performance difference? The same thing happens when the program runs on different nodes of the same configuration. Here is the part of the code where OpenMP is used.

#define NUMBER_OF_THREADS 20
...
...
void someFunction(){
#pragma omp parallel for num_threads(NUMBER_OF_THREADS) collapse(2)
    for (size_t i = 0; i < NX; i++) {
        for (size_t j = 0; j < NY; j++) {
            // work to be done

        }
    }

}

Here is the SLURM job submission script,

#!/bin/bash
#SBATCH -J name               
#SBATCH -p partitionName
#SBATCH -n 1                       # no of processes
#SBATCH --cpus-per-task=20

module load compiler/intel/2019.5.281

cd my_working_directory
path_of_the_executable

Here is the node details

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                40
On-line CPU(s) list:   0-39
Thread(s) per core:    1
Core(s) per socket:    20
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
subrata
  • 111
  • 3
  • 8
  • 1
    Experience show that the problem is often in the `work to be done` you did not provide. If there is a lock or system calls or anything like this, then this is a normal behaviour. It is hard as a user to know that since even memory accesses can cause hidden system calls (inter-process/device shared memory, page faults, etc.). Over-subscription of BLAS is also a frequent issue. This is impossible to tell with the current description. Besides, did you check the threads correctly pinned to cores? – Jérôme Richard Jun 04 '22 at 11:20
  • Here, I am doing basic calculations like addition, subtraction, sin, cos, exp on the elements of flatten vectors and not using BLAS. If both the run would show the same but lower CPU usage, then it is obviously the problem of `work to be done` part , but here that was not the case. – subrata Jun 04 '22 at 15:17
  • Ok and what about the thread pinning? There is a high chance this is related to that. There are memory accesses isn't it? If so the balancing can be (and often is) dependent of the NUMA page placement and the imbalance due to due to the default allocation policy (especially if hyper-threads are used which is not clear). You can check if the problem does not comes from this by removing any memory accesses in the parallel loop (at least all that are non local). You can also tweak the NUMA policy. – Jérôme Richard Jun 04 '22 at 17:36
  • [This post](https://stackoverflow.com/questions/62604334/how-do-numactl-perf-change-memory-placement-policy-of-child-processes/62615032#62615032) about `numactl` and the binding od threads may help. By the way, the default Linux scheduler is known no to be reliable when it comes to schedule thread properly on HPC machines. Consider reading [The Linux Scheduler: a Decade of Wasted Cores](https://people.ece.ubc.ca/sasha/papers/eurosys16-final29.pdf). SLURM and the Intel OpenMP runtimes should automatically pin threads to cores but it sometimes fail due to bugs on some HPC machines. – Jérôme Richard Jun 04 '22 at 17:41
  • Please provide the sample reproducer code and steps you have followed to reproduce the issue at our end? – HemanthCH Jun 20 '22 at 10:46

0 Answers0