1

We are writing a performance critical application, which has 3 main parameters: N_steps (about 10000) N_nodes (anything from 20 to 5000) N_size (range about 1k-10k)

The algorithm has essentially this form

for (int i=0; i<N_steps; i++)
{
    serial_function(i); 
    parallel_function(i,N_nodes);
}

where

parallel_function(i,N_nodes) {
    #pragma omp parallel for schedule (static) num_threads(threadNum)
    for (int j=0; j<N_nodes j++)
    {
        Local_parallel_function(i,j) //complexity proportional to N_size
    }
}

and Local_parallel_function is a function performing linear algebra and it typically has a run time of about 0.01-0.04 seconds or even more, and this execution time should be pretty stable within the loop. Unfortunately the problem is sequential in nature, so I cannot write the outer loop differently.

I have noticed while profiling that a huge deal of time is spent in the NtYieldExecution function (up to 20%, if I use HT on 4 cores).

I did some test playing with the parameters and I found out that this percentage:

  • Increases with the number of threads

  • Decreases as N_nodes and N_size increase.

Most likely for OpenMP the parallel loop is currently not large enough, and making it larger or the function more computationally expensive helps in reducing this overhead.

To try to have a better insight, I downloaded the Intel Profiler, and I obtained the following results:

Results from V Tune-1 Results from V Tune-2

The region in red are the spin times, and the threads on top are the one spawned by OpenMP.

Any suggestion on how to manage and reduce this effect?

I use windows 10, Visual Studio 15.9.5 and OpenMP. Unfortunately it seems that the Intel Compiler is not able to compile one dependent library, so I am stuck with the Microsoft one.

  • Did you try not to use the HT? Usually, these hardware threads are not good for compute-intensive applications where they compete between each-other for essential parts of the same actual CPU. Try to stick to a number of OpenMP threads equal to the number of CPU cores (not HT). And if it improves things, you can even consider disabling the HT at the BIOS level. – Gilles Jan 24 '19 at 04:29
  • I tried to measure the speedup by doubling the number of threads: 1->2 : 50% 2->4 : 25% 4->8 (HT): 15% So it is beneficial although it is saturating already at 4 cores. On average I use around 3 cores. – Enzo Ferrazzano Jan 24 '19 at 06:58
  • 1
    Looking at your profiler's output, it seems to me that a lot of the top functions are either wrappers, or data copying. That could mean that your code waist a lot of time in reorganizing data in order to call optimized functions, and/or that it is memory bound. Could you try not to use OpenMP at compile time and take advantage of the MKL's parallelisation (you call DGEMM quite a lot) by setting in your environment `MKL_NUM_THREADS` to the number of threads you want at run time... – Gilles Jan 24 '19 at 07:18
  • At an early stage we tried two strategies: try to pack everything in a big problem, to be solved via parallelism. or write a for loop of smaller, independent problems to parallelise with OpenMP It seemed that the second approach was faster, and definitely easier to debug. Perhaps the first approach will scale better with the number of CPU. We will give it try again – Enzo Ferrazzano Jan 24 '19 at 08:58

0 Answers0