1

I wanted to be able to run two functions in parallel and used an OpenMP parallel region to solve this. But my code became much slower by doing this. After a little testing I noticed, that the parallel region was not the problem, it was indeed faster than before but another part of my code (which I had not changed) became a factor 50 slower.

An hour testing later, I noticed that the omp parallel for in the second region was causing this behavior. It seems that two consecutive parallel regions slow down each other. I wrote a small test program to verify this:

#include <iostream>
#include <chrono>
#include <thread>
#include <omp.h>
#include <vector>
#include <cmath>

using namespace std::chrono_literals;

inline void func1()
{
    // simulate a cpu function that takes 5ms
    std::this_thread::sleep_for(5ms);
}

inline void func2()
{
    // simulate a cpu function that takes 6ms
    std::this_thread::sleep_for(6ms);
}

int main()
{
    // initialize some vectors to test an omp parallel for pragma
    std::vector<float> vec1(10000);
    std::vector<float> vec2(10000);
    std::vector<float> vec3(10000);
    for(int i = 0; i < 10000; i++)
    {
        vec1[i] = (i+1)/1000.0;
        vec2[i] = i;
        vec3[i] = 0;
    }

    // timings taken via std::chrono
    typedef std::chrono::time_point<std::chrono::high_resolution_clock>
                                          time_point;
    typedef std::chrono::duration<double, std::milli> duration;

    // first block
    std::cout << "serial wait, serial loop" << std::endl;
    for(int k = 0; k < 20; k++)
    {

        time_point start = std::chrono::high_resolution_clock::now();

        func1();
        func2();

        duration time1 = std::chrono::high_resolution_clock::now() - start;
        start = std::chrono::high_resolution_clock::now();

        for(int i = 0; i < 10000; i++)
        {
            vec3[i] = sqrt(sin(pow(vec1[i],vec2[i]))*sin(0.5*pow(vec1[i],vec2[i])));
        }

        duration time2 = std::chrono::high_resolution_clock::now() - start;

        std::cout << k << " " << time1.count() << " " << time2.count() << std::endl;
    }

    // second block
    std::cout << "parallel wait, serial loop" << std::endl;
    for(int k = 0; k < 20; k++)
    {

        time_point start = std::chrono::high_resolution_clock::now();

        #pragma omp parallel num_threads(2)
        {
            if(omp_get_thread_num() == 0)
            {
                func1();
            }
            else
            {
                func2();
            }
        }

        duration time1 = std::chrono::high_resolution_clock::now() - start;
        start = std::chrono::high_resolution_clock::now();

        for(int i = 0; i < 10000; i++)
        {
            vec3[i] = sqrt(sin(pow(vec1[i],vec2[i]))*sin(0.5*pow(vec1[i],vec2[i])));
        }

        duration time2 = std::chrono::high_resolution_clock::now() - start;

        std::cout << k << " " << time1.count() << " " << time2.count() << std::endl;
    }

    // third block
    std::cout << "serial wait, parallel loop" << std::endl;
    for(int k = 0; k < 20; k++)
    {

        time_point start = std::chrono::high_resolution_clock::now();

        func1();
        func2();

        duration time1 = std::chrono::high_resolution_clock::now() - start;
        start = std::chrono::high_resolution_clock::now();

        #pragma omp parallel for
        for(int i = 0; i < 10000; i++)
        {
            vec3[i] = sqrt(sin(pow(vec1[i],vec2[i]))*sin(0.5*pow(vec1[i],vec2[i])));
        }

        duration time2 = std::chrono::high_resolution_clock::now() - start;

        std::cout << k << " " << time1.count() << " " << time2.count() << std::endl;
    }

    // fourth block <-- weird behavior
    std::cout << "parallel wait, parallel loop" << std::endl;
    for(int k = 0; k < 20; k++)
    {

        time_point start = std::chrono::high_resolution_clock::now();

        #pragma omp parallel num_threads(2)
        {
            if(omp_get_thread_num() == 0)
            {
                func1();
            }
            else
            {
                func2();
            }
        }

        duration time1 = std::chrono::high_resolution_clock::now() - start;
        start = std::chrono::high_resolution_clock::now();

        #pragma omp parallel for
        for(int i = 0; i < 10000; i++)
        {
            vec3[i] = sqrt(sin(pow(vec1[i],vec2[i]))*sin(0.5*pow(vec1[i],vec2[i])));
        }

        duration time2 = std::chrono::high_resolution_clock::now() - start;

        std::cout << k << " " << time1.count() << " " << time2.count() << std::endl;
    }    
}

If I run this, I get from console:

serial wait, serial loop
0 11.8541 3.23881
1 11.4908 3.18409
2 11.8729 3.12847
3 11.6656 3.19606
4 11.8484 3.14534
5 11.863 3.20833
6 11.8331 3.13007
7 11.8351 3.20697
8 11.8337 3.14418
9 11.8361 3.21004
10 11.833 3.12995
11 11.8349 3.14703
12 11.8341 3.1457
13 11.8324 3.14509
14 11.8339 3.12721
15 11.8382 3.14233
16 11.8368 3.14509
17 11.8335 3.14625
18 11.832 3.15115
19 11.8341 3.14499
parallel wait, serial loop
0 6.59906 3.14325
1 6.42459 3.14945
2 6.42381 3.13722
3 6.43271 3.19783
4 6.42408 3.12781
5 6.42404 3.14482
6 6.42534 3.20757
7 6.42392 3.14144
8 6.425 3.14805
9 6.42331 3.1312
10 6.4228 3.14783
11 6.42556 3.15106
12 6.42523 3.14562
13 6.42523 3.14605
14 6.42399 3.12967
15 6.42273 3.14699
16 6.42276 3.15026
17 6.42471 3.14164
18 6.42302 3.14701
19 6.42483 3.19149
serial wait, parallel loop
0 11.8319 4.51681
1 11.4756 0.928738
2 11.1129 0.221045
3 11.1075 0.220827
4 11.1081 0.220197
5 11.1065 0.218774
6 11.1059 0.218329
7 11.1658 0.218804
8 11.1063 0.218056
9 11.107 0.21789
10 11.108 0.218605
11 11.1059 0.217867
12 11.1218 0.218198
13 11.1059 0.217666
14 11.1056 0.219443
15 11.1064 0.217653
16 11.106 0.21729
17 11.1064 0.217565
18 11.1085 0.217965
19 11.1056 0.21735
parallel wait, parallel loop
0 6.41053 6.92563
1 6.06954 4.88433
2 6.4147 0.948097
3 6.41245 5.95226
4 6.41169 4.20988
5 6.41415 3.34145
6 6.41655 4.26902
7 6.41321 1.80355
8 6.41332 1.53747
9 6.41386 1.5394
10 6.06738 1.88866
11 6.41286 1.531
12 6.4133 1.53643
13 6.41356 6.40577
14 6.70144 3.48257
15 6.41551 3.60291
16 6.39516 4.44704
17 6.92893 0.981749
18 6.41533 1.50914
19 6.41685 8.36792

The first three blocks of output as one would expect: the serial wait for 5 and 6 ms needs about 11 ms. The vector computation 3.1ms. If I parallelize my two waits it takes as much time as the slowest of both (6ms). And the (12 thread) parallel for loop takes about 0.22 ms.

But the fourth block, where I use both parallel regions consecutively, seems to be very strange. The timing of my loop almost appears to be random and much slower (even in the best case) than in the third block. It reflects the before-mentioned behavior in my non-test code.

Can anyone explain me, why this happens and how to solve this?

I compile using:

g++ main.cpp -O3 -fompenmp

I also tested this code on a windows pc and saw the same behavior.

If I delete the num_threads(2) in the first parallel region, the problem is less worse but still apparent. (But num_threads(2) should just affect the first region, shouldn't it?)

Thanks in advance.

jww
  • 97,681
  • 90
  • 411
  • 885
Catrexis
  • 73
  • 6
  • Dunno about your problem, but using [sections](https://bisqwit.iki.fi/story/howto/openmp/#Sections) instead of that thread num stuff would clean up your code a lot and be more idiomatic. – Shawn Oct 15 '18 at 16:06
  • 1
    *"It seems that two consecutive parallel regions slow down each other..."* is kind of interesting. I wonder how many threads are being used for third and fourth blocks (the `parallel for`). How many cores does your CPU have? Does `OMP_NUM_THREADS=2` make a difference? – jww Oct 15 '18 at 16:24
  • You should also consider parallelizing the first data setting routine, due to first touch as explained [here](https://www.nersc.gov/users/computational-systems/cori/application-porting-and-performance/improving-openmp-scaling#toc-anchor-3). – Timothy Brown Oct 15 '18 at 16:39
  • Thank you for the excellent [mcve] and description of your observation. As this is pretty low-level, you could further improve it by specifying your CPU model, gcc version, and whether you have any `OMP` environment variables set. – Zulan Oct 15 '18 at 16:41

2 Answers2

0

I have some trouble consistently reproducing this, but I believe this is what's happening.

Libgomp (gcc's OpenMP runtime) uses a thread pool. If you just have successive parallel loops, no new threads are spawned. If, however, you alternate the parallel for - which uses 12 threads, and the parallel num_threads(2), libgomp decides to shrink the thread pool down to 2 and blow it up to 12 again.

I verified this by printing gettid() of the OpenMP thread #1/#2 in the parallel loop. While #1 keeps it's pid, #2 gets a new one for each iteration. As you noticed yourself you can easily fix this. As noted by Shawn, parallel sections are the more idiomatic solution anyway.

As for the remaining difference, I can't currently reproduce it - I'm getting very noisy results. You can try pinning the threads to CPUs. Otherwise, if such small time periods are involved, you may have to keep the parallel region alive by moving it outside of the code invoking the two regions while ensuring correct control flow.

Zulan
  • 21,896
  • 6
  • 49
  • 109
0

As Zulan pointed out, the "solution" is keeping the parallel regions alive:

// fifth block <-- "solution"
std::cout << "parallel wait, parallel loop" << std::endl;
for(int k = 0; k < 20; k++)
{

    time_point start = std::chrono::high_resolution_clock::now();

    #pragma omp parallel
    {
        if(omp_get_thread_num() == 0)
        {
            func1();
        }
        else if(omp_get_thread_num() == 1)
        {
            func2();
        }

        #pragma omp for
        for(int i = 0; i < 10000; i++)
        {
            vec3[i] = sqrt(sin(pow(vec1[i],vec2[i]))*sin(0.5*pow(vec1[i],vec2[i])));
        }
    }
    duration time1 = std::chrono::high_resolution_clock::now() - start;

    std::cout << k << " " << time1.count() << std::endl;
} 

In this way, the result is even faster than the sum of the parallel wait (2nd block) and the parallel loop (3rd block) timing (~6.3ms) in each iteration. Unfortunately, this solution does not work in my real application but I will start another topic for that problem.

I noticed, that the problem only occurs if hyperthreading is used. My CPU has 6 cores, supporting 12 Threads using hyperthreading. If I run the test code with OMP_NUM_THREADS=6, the strange behavior in the 4th block is gone.

Catrexis
  • 73
  • 6