OpenMP Parallel Sections Within For Loop (C++) - Overhead

Question

I have been working on a quantum simulation. Each time step a potential function is calculated, one step of the solver is iterated, and then a series of measurements are conducted. These three processes are easily parallelizable, and I've already made sure they don't interfere with each other. Additionally there is some stuff that is fairly simple, but should not be done in parallel. An outline of the setup is shown below.

omp_set_num_threads(3);
#pragma omp parallel
{
    while (notDone) {
        #pragma omp sections
        {
            #pragma omp section
            {
                createPotential();
            }
            #pragma omp section
            {
                iterateWaveFunction();
            }
            #pragma omp section
            {
                takeMeasurements();
            }
        }
        #pragma omp single
        {
            doSimpleThings();
        }
    }
}

The code works just fine! I see a speed increase, mostly associated with the measurements running alongside the TDSE solver (about 30% speed increase). However, the program goes from using about 10% CPU (about one thread) to 35% (about three threads). This would make sense if the potential function, TDSE iterator, and measurements took equally as long, but they do not. Based on the speed increase, I would expect something on the order of 15% CPU usage.

I have a feeling this has to do with the overhead of running these three threads within the while loop. Replacing

#pragma omp sections

with

#pragma omp parallel sections

(and omitting the two lines just before the loop) changes nothing. Is there a more efficient way to run this setup? I'm not sure if the threads are constantly being recreated, or if the thread is holding up an entire core while it waits for the others to be done. If I increase the number of threads from 3 to any other number, the program uses as much resources as it wants (which could be all of the CPU) and gets no performance gain.

OpenMP does sometimes let the threads spin for no apparent reason. However, I don't exactly know what causes it. Given that you're just trying to run 3 functions in parallel, you might try using STL, I think something like `std::async` could work. You can even make your main thread wait for the 3 to finish by giving them a return value and binding it to a `std::future` object. — Qubit, Aug 08 '18 at 07:36
I think there is a barrier at the single statement so the other threads should sleep. Also I would look into `OMP_WAIT_POLICY` https://stackoverflow.com/a/12617270/2542702 — Z boson, Aug 08 '18 at 07:51
Maybe this is an issue with sections and a could be a reason to use tasks. Have you tried using tasks instead of sections? Sections to me seem to be the old way of doing things before tasks were added in OpenMP 3.0. — Z boson, Aug 08 '18 at 07:56
You could try `#pragma omp sections nowait` and then add a `#pragma omp barrier` if you need one. — Z boson, Aug 08 '18 at 08:06
Tasks are probably a better fit https://stackoverflow.com/a/13789119/2542702 — Z boson, Aug 08 '18 at 08:09
With tasks I don't think `omp_set_num_threads` is really helpful either. — Z boson, Aug 08 '18 at 08:17
Here are two tasks based variants you could try https://godbolt.org/g/pQmNf5 — Z boson, Aug 08 '18 at 08:32

Joshua M · Accepted Answer · 2018-08-09T01:15:13.407

1

I've tried many options, including using tasks instead of sections (with the same results), switching compilers, etc. As suggested by Qubit, I also tried to use std::async. This was the solution! The CPU usage dropped from about 50% to 30% (this is on a different computer from the original post, so the numbers are different -- it's a 1.5x performance gain for 1.6x CPU usage basically). This is much closer to what I expected for this computer.

For reference, here is the new code outline:

void SimulationManager::runParallel(){
    auto rV = &SimulationManager::createPotential();
    auto rS = &SimulationManager::iterateWaveFunction();
    auto rM = &SimulationManager::takeMeasurements();
    std::future<int> f1, f2, f3;
    while(notDone){
        f1 = std::async(rV, this);
        f2 = std::async(rS, this);
        f3 = std::async(rM, this);
        f1.get(); f2.get(); f3.get();
        doSimpleThings();
    }
}

The three original functions are called using std::async, and then I use the future variables f1, f2, and f3 to collect everything back to a single thread and avoid access issues.

edited Aug 09 '18 at 01:15

answered Aug 08 '18 at 23:34

Joshua M

61
6

What compiler options did you use? What OS? What compiler? What was the hardware? – Z boson Aug 09 '18 at 09:03
This is on Windows 10. I tried the Visual Studio 2017 and also the Intel C++ (integrated with VS) compilers. There are a lot of compiler options built into VS, are you looking for specific ones? In terms of hardware, my laptop has an i7-4710 HQ, 2.5 GHz -- not sure if any other specs are too important for this case. – Joshua M Aug 09 '18 at 18:20
Probably the most important is the OS. You could see different results on Linux (due e.g. to a different thread scheduler). MSVC's implementation of OpenMP is very old but ICC's is good so since you tried ICC I think makes a compiler bias less likely. But at a more basic level both OpenMP implementations are based on Windows threads and not on pthreads. In any case, your question and answer is interesting. – Z boson Aug 10 '18 at 08:56

OpenMP Parallel Sections Within For Loop (C++) - Overhead

1 Answers1