3

I'm having trouble with OpenMP on macOs with the default compiler and libomp for OpenMP features.

More precisely, when I trace the CPU usage of the following code, I would expect to have two cores at 100% (independent sections and threads).

On Linux, it is perfectly fine and I do have a 200% CPU usage for my process.

On Mac, I have a very strange behavior traced out by Instruments: there is a peak at the very beginning with ~1000% CPU usage (my computer is a bi-proc 6-Core Intel Xeon E5 ~12 cores) with a lot of threads that are created, then a plateau at 200% (as expected).

The thing is that the peak/warmup completely kills my performances when I iterate over parallel sections.

Has anyone an explanation for that?

edits: I have updated the code to clarify the problem and add a snapshot of the execution behavior: Execution profile using macOs Instrumentes (warmup at ~1000%, plateau at 200%, a smaller one at 100% --function2 is slightly slower than function1--, and it is repeated twice).

#include <iostream>
#include <algorithm>
#include <vector>
#include <cmath>

auto N= 5000;
std::vector<int> vA(N),vB(N);
void function_1()
{
  for (int k = 0; k != 3; k++)
  {
    std::cout << "Function 1 (k = " << k << ")" << std::endl;
    for(auto i=0; i < N ; ++i)
      for(auto j=0; j < N ; ++j)
        vA[j] +=  i+cos(j); //Doing something meaningless
  }
}

void function_2()
{
  for (int k = 0; k != 4; k++)
  {
    std::cout << "Function 2 " << "(k = " << k << ")" << std::endl;
    for(auto i=0; i < N ; ++i)
      for(auto j=0; j < N ; ++j)
        vB[j] +=  i+sin(j); //Doing something meaningless
  }
}

int main()
{
  for(auto y = 0; y < 2 ; ++y)
  {
#pragma omp parallel sections
   {
#pragma omp section
    function_1();

#pragma omp section
    function_2();
   }
  }
  return 0;
}
dcoeurjo
  • 303
  • 3
  • 10
  • Even though this is probably not your main issue, don't use `rand` in parallel: https://stackoverflow.com/questions/10624755/openmp-program-is-slower-than-sequential-one I strongly recommend to work on reducing your issue to a [mcve] and include more specific performance measurements rather tha your generic observation... A usage of `10000%` (100 fully active threads) makes no sense on a 12-core system. – Zulan Sep 02 '19 at 16:26
  • ;) agree... I knew for rand(). and it was a typo for the ~10000%. thx – dcoeurjo Sep 03 '19 at 13:54

1 Answers1

0

A few things I see that I think need being fixed in your code:

  1. rand() should never be used in parallel: the function uses a global internal state and is therefore either non-thread-safe, and you'll have about anything as results (including possible repetitions of your numbers across threads), or thread safety is silently enforced by mean of a mutex or equivalent, and accesses will be serialized and performance appalling. Bottom line: find a thread-safe alternative, where the state is explicitly passed as a parameter to the function like rand_r() for example
  2. Just define the number of threads to use at run-time with the OMP_NUM_THREADS environment variable instead of letting the OS decide to spawn as many as there are of cores in your machine. For example, type OMP_NUM_THREADS=2 ./mycode instead of ./mycode (assuming mycode is the name of your binary)
  3. Just be careful that in both of your functions, you're reusing i and j in the outmost loops, but shadowing them in the inner ones. Here, that's not really an issue but that's very bad practice which I recommend fixing
Gilles
  • 9,269
  • 4
  • 34
  • 53
  • 1
    For 1 and 3 I agree, the functions code was just to perform random and meaningless computations to illustrate the problem (though I'll update the code to make it clear). Concerning 2), controlling the number of threads at exec time using the env variable just hides the problem if I limit the number of threads. IMHO that does not solve the issue I'm having. – dcoeurjo Sep 03 '19 at 13:58
  • You say limiting the number of threads "hides the problem" but doesn't solve it... So what is a solution? Something that avoids your machine from going berserk when starting a parallel region and permitting to get a descent speed-up? Then doesn't defining the number of threads to use serve that purpose? And did you try it? Anyway, if that doesn't look like an acceptable solution for you, I guess you need to change of OpenMP implementation on your Mac, for one which manages better the spawning and scheduling of threads. – Gilles Sep 03 '19 at 15:23
  • 1
    My point was to understand the behavior of parallel section of the `libomp` implementation on mac by first asking if someone has similar issues. Apparently, the issue also occurs on the default openmp implementation of the windows "Visual Studio compiler" (each `#pragma omp parallel sections` launches a lot of threads, then shut them down to only use the correct number of threads (i.e. the number of `section`). I haven't tested with gcc's implementation but that puzzles me. – dcoeurjo Sep 03 '19 at 15:47
  • 1
    In my original post, I was just asking for an explanation ;). I would definitely expect any openmp implementation to only spawn the number of threads that correspond to the (explicit and static) number of sections but it is not the case and I don't understand why – dcoeurjo Sep 03 '19 at 15:49