OpenMP slows down unrelated serial loop

Question

I have two unrelated for loops, one is executed serial and one is executed with an OpenMP parallel for construct.

The next serial code becomes slower the more OpenMP-Threads I'm using.

class Foo {
public:
    Foo(size_t size) {
        parallel_vector.resize(size, 0.0);
        serial_vector.resize(size, 0.0);
    }

    void do_serial_work() {
        std::mt19937 random_number_generator;
        std::uniform_real_distribution<double> random_number_distribution{ 0.0, 1.0 };

        for (size_t i = 0; i < serial_vector.size(); i++) {
            serial_vector[i] = random_number_distribution(random_number_generator);
        }
    }

    void do_parallel_work() {
#pragma omp parallel for
        for (auto i = 0; i < parallel_vector.size(); ++i) {
            for (auto integration_steps = 0; integration_steps < 30; integration_steps++) {
                parallel_vector[i] += (0.05 - parallel_vector[i]) / 30.0;
            }
        }
    }

private:
    std::vector<double> parallel_vector;
    std::vector<double> serial_vector;
};

void test_with_size(size_t size, int num_threads) {
    std::cout << "Testing with " << num_threads << " and size: " << size << "\n";
    omp_set_num_threads(num_threads);

    Foo foo{ size };

    long long total_dur_1 = 0;
    long long total_dur_2 = 0;

    for (auto i = 0; i < 500; i++) {
        const auto tp_1 = std::chrono::high_resolution_clock::now();
        foo.do_serial_work();
        
        const auto tp_2 = std::chrono::high_resolution_clock::now();
        foo.do_parallel_work();

        const auto tp_3 = std::chrono::high_resolution_clock::now();
        const auto dur_1 = std::chrono::duration_cast<std::chrono::microseconds>(tp_2 - tp_1).count();
        const auto dur_2 = std::chrono::duration_cast<std::chrono::microseconds>(tp_3 - tp_2).count();

        total_dur_1 += dur_1;
        total_dur_2 += dur_2;
    }

    std::cout << total_dur_1 << "\t" << total_dur_2 << "\n";
}

int main(int argc, char** argv) {
    test_with_size(100000, 1);
    test_with_size(100000, 2);
    test_with_size(100000, 4);
    test_with_size(100000, 8);

    return 0;
}

The slowdown happens on my local machine, a Win10 Laptop having an Intel Core i7-7700 with 4 cores and hyperthreading, 24 GB of RAM. The compiler is the latest in VisualStudio 2019. Compiled in RelWithDebugMode (from CMake, include /O2 and /openmp).

It does not happen when I use a stronger machine, a CentOS 8 with 2x Intel Xeon Platinum 9242 with 48 cores each, no hyperthreading, 769 GB of RAM. The compiler is gcc/8.3.1. Compiled with g++ --std=c++17 -O3 -fopenmp.

Timings on Win10 i7-7700:

Testing with 1 and size: 100000
3043846 10536315
Testing with 2 and size: 100000
3276611 5350204
Testing with 4 and size: 100000
3937311 2735655
Testing with 8 and size: 100000
5002727 1598775

and on CentOS 8, 2x Xeon Platinum 9242:

Testing with 1 and size: 100000
727756  4111363
Testing with 2 and size: 100000
731649  2069257
Testing with 4 and size: 100000
734019  1056157
Testing with 8 and size: 100000
752584  544373

So my initial thought was "There's too much pressure on the cache". However, when I removed virtually everything from the parallel section but the loop, the slowdown happened again.

Updated parallel section with the work taken out:

    void do_parallel_work() {
#pragma omp parallel for
        for (auto i = 0; i < 8; ++i) {
            //for (auto integration_steps = 0; integration_steps < 30; integration_steps++) {
            //    parallel_vector[i] += (0.05 - parallel_vector[i]) / 30.0;
            //}
        }
    }

Timings on Win10 with updated parallel section:

Testing with 1 and size: 100000
3206293 636
Testing with 2 and size: 100000
3218667 2672
Testing with 4 and size: 100000
3928818 8689
Testing with 8 and size: 100000
5106605 10797

Looking into the OpenMP 2.0 standard (VS does only support 2.0) (find it here: https://www.openmp.org/specifications/), it says in 2.7.2.5 lines 7,8:

In the absence of an explicit default clause, the default behavior is the same as if the default(shared) were specified.

And in 2.7.2.4 line 30:

All threads within the team access the same storage area for shared variables.

For me, this rules out that the OpenMP threads each copy serial_vector, which was the last explanation I could think of.

I'm happy for any explanation/ discussion on that matter, even if I just plainly missed something.

EDIT:

Out of curiosities sake, I also tested on my Win10 machine with the WSL. Runs gcc/9.3.0, and the timings are:

Testing with 1 and size: 100000
833678  2752
Testing with 2 and size: 100000
762877  1863
Testing with 4 and size: 100000
816440  1860
Testing with 8 and size: 100000
991184  2350

I'm honestly not sure why the windows executable takes so much longer an the same machine as the linux one (optimization /O2 for VC++ is max), but funnily enough, the same artifacts don't happen here.

optimizations turned on? Please include the options you used to compile — 463035818_is_not_an_ai, Feb 15 '21 at 10:30
How long are you actual timed regions? Short enough to be affected by the fact that all-cores max turbo is lower than 1-core max turbo? (With 2-core max turbo maybe somewhere in between.) Note that Intel "client" (non-server) chips like i7-7700 share the same clock frequency for all cores, but "server" chips don't (so each core would have to ramp up to its own max turbo individually). On Skylake-derived chips with hardware p-state management, the decision to change frequency can take only a few microseconds, depending on EPP (energy performance preference), or as much as a millisecond. — Peter Cordes, Feb 15 '21 at 10:34
Also have a look at the common pitfalls (like page faults when you first touch memory) and other warm-up effects in [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) to see if any of that applies. Looks like you're resizing your `std::vector`s in the constructor, outside the timed region, so getting the page faults out of the way. (Normally that would be dumb vs. arranging to let the RNG be the first write without wasting time writing zeros or copying as you grow, wrestling C++ into submission, but for timing you want this.) — Peter Cordes, Feb 15 '21 at 10:38
omp has internal spinlocks, on msvc for 200ms which can affect serial loop a bit. Have you tried switching to OMP_WAIT_POLICY passive ? (Define env. variable OMP_WATI_POLICY with value PASSIVE), make sure to restart visual studio after this change. — AdamF, Feb 15 '21 at 10:39
@largest_prime_is_463035818 I added the compile flags. I forgot to optimize the CentOS version, which doesn't change the issue at hand. — Fabian, Feb 15 '21 at 10:39
@PeterCordes Can you elaborate how your first comment is applicable? If there's no work to be done in the parallel region, how does that effect the serial timings? — Fabian, Feb 15 '21 at 10:43
@AdamF: How exactly would OpenMP spinlocks affect the serial loop? Are you saying OMP worker threads would still be spinning waiting for work from the last parallel loop, because it wouldn't bother to tell them to stop waiting when exiting the `#pragma omp parallel for`? (perhaps optimistically hoping it will enter another parallel loop soon). And that the OP should try running a serial loop *after* the 8-thread parallel test as a worst case? — Peter Cordes, Feb 15 '21 at 10:45
@PeterCordes Regarding the second comment: Doing the same work once before (without timing) does not change anything, and the cache should be filled by then, shouldn't it? — Fabian, Feb 15 '21 at 10:46
Cache isn't "full" or "empty"; the question is *what* it's caching. (If you mean it has lots of dirty data needing write-back, that's possible). I notice that `serial_vector.resize(size, 0.0);` in the constructor happens right before the first serial_work loop, so you can expect your data to be hot in L3 cache for the first serial work. But also for the next serial work because your vectors are only 0.76MiB large, so they exceed 256k L2 and easily fit in 8M L3. If anything you'd expect an effect on the Xeon Platinum (with its 1MiB private L2 caches but slower L3 than your i7). — Peter Cordes, Feb 15 '21 at 10:51
@PeterCordes exactly. After leaving omp parallel for, the threads are still spinning (on MSVC Active mode is default and the spin time is 200ms). They use mechanisms like Sleep(0) or somehting similar, but still I observed sometimes some perfromance impact on serial loops. With most recent MSVC2019 Preview you can test llvm omp runtime ( https://devblogs.microsoft.com/cppblog/improved-openmp-support-for-cpp-in-visual-studio/ ) — AdamF, Feb 15 '21 at 10:51
Just FYI, your "parallel" work is entirely bottlenecked on divider throughput, unless you compile with `-ffast-math` or manually turn `/ 30.0` into `* (1.0/30)`. Also with `taskset -c 3` to force all threads onto the same core, the effect disappears. But I can repro it on my i7-6700k desktop without that (`g++ -O3 -march=native -fopenmp` GCC 10.2), with EPP = `balance_performance` so max turbo = 3.9GHz for single or all cores, on this motherboard, ruling out turbo up/down shift costs. But I only see a spike after the cores=4 case, flat serial times for the first 3 tests. — Peter Cordes, Feb 15 '21 at 11:01
@Xenikh - your edit should not have been accepted. In future edits, don't use `code formatting` for names and phrases that aren't code. If you want to emphasize something, use bold or italics. — Peter Cordes, Feb 16 '21 at 02:32
@AdamF Are you sure you can set the waiting policy with MSVC? The variable is not listed among [the ones understood by the runtime](https://learn.microsoft.com/en-us/cpp/parallel/openmp/reference/openmp-environment-variables?view=msvc-160). Besides, `OMP_WAIT_POLICY` was introduced in OpenMP 3.0 and the one MSVC supports is 2.0 with backported SIMD extensions. — Hristo Iliev, Feb 16 '21 at 09:11
@HristoIliev yes, I'm sure. I used it many times. Source: https://support.microsoft.com/en-us/topic/redistributable-package-fix-high-cpu-usage-when-you-run-a-visual-c-2010-application-built-together-with-the-openmp-option-enabled-in-visual-studio-2010-f5b2cde4-93e3-f6a2-3dbb-c26652691058 — AdamF, Feb 16 '21 at 09:31
Today, I had more time to debug it and @AdamF is completely correct, that fixed the issues. If you post that as an answer, I'm happy to accept it! — Fabian, Feb 16 '21 at 16:42

score 2 · Accepted Answer · answered Feb 24 '21 at 20:15

OpenMP on Windows by default has 200ms spinlocks. It means when you leave the omp block then all omp worker threads are spinning waiting for new work. It has benefit if you have many omp blocks next to each other. In your case the threads just consume CPU power.

To disable/control the spinlocks you have several options:

Define environment variable OMP_WAIT_POLICY and set it to PASSIVE to disable spinlocs completely,
Switch to Intel OMP Runtime shipped with OneAPI. Then you can fully control the spin lock time by defining KMP_BLOCKTIME environment variable ,
Install Visual Studio 2019 Preview (soon should be in the official release) and use llvm omp. Then you can also control spinlock time by defining KMP_BLOCKTIME environment variable.

OpenMP slows down unrelated serial loop

1 Answers1