OpenMP parallel loop much slower than regular loop

Question

The whole program has been shrunk to a simple test:

    const int loops = 1e10;
    int j[4] = { 1, 2, 3, 4 };
    time_t time = std::time(nullptr);
    for (int i = 0; i < loops; i++) j[i % 4] += 2;
    std::cout << std::time(nullptr) - time << std::endl;

    int k[4] = { 1, 2, 3, 4 };
    omp_set_num_threads(4);
    time = std::time(nullptr);
#pragma omp parallel for
    for (int i = 0; i < loops; i++) k[omp_get_thread_num()] += 2;
    std::cout << std::time(nullptr) - time << std::endl;

In the first case it takes about 3 seconds to run through the loop, in the second case the result is inconsistent and may be 4 - 9 seconds. Both of the loops run faster with some optimization enabled (like favouring speed and whole program optimization), but the second loop is still significantly slower. I tried adding barrier at the end of the loop and explicitly specifying the array as shared, but that didn't help. The only case when I managed to make the parallel loop run faster is by making the loops empty. What may be the problem?

Windows 10 x64, CPU Intel Core i5 10300H (4 cores)

A few suspects as pointer: i) 1e10 is out of int range, ii) compilers often optimize out these for loops, you need to use volatile int, iii) the measuring technique does not reflect actual cpu time. — haleyk, Apr 06 '22 at 13:28
The `k[omp_get_thread_num()]` in your code is causing false sharing, as the cache line that contains the `k` array is being sent back and forth between the caches of the system. If you change the definition of the array to be like 4*16 integers and if you use `k[omp_get_thread_num() * 16]` instead the problem should be gone. — Michael Klemm, Apr 06 '22 at 13:31
@MichaelKlemm That seems to work. But how should I then use the parallel for loop is I want each element to be processed, not every 16th one? — Kaiyakha, Apr 06 '22 at 13:42
@n.1.8e9-where's-my-sharem. Why 16? Where does this number come from? — Kaiyakha, Apr 06 '22 at 13:53
@Kaiyakha: "Why 16? Where does this number come from?" -- Assuming `sizeof(int)==4`, you can store 16 integers in a single cache line (assuming a cache line size of 64 bytes). — Andreas Wenzel, Apr 06 '22 at 14:24
Yes, the 16 comes from the number of 4-bytes integer values that can be stored per 64-byte cache line. — Michael Klemm, Apr 06 '22 at 15:32
Note that while doing `*16` in this code should works on most machine, this is not portable. First, there is no guarantee that a cache line is 64-byte (despite this is common). Moreover, `sizeof(int)` is not guaranteed to be 4 (but again this is very common). `alignas(std::hardware_destructive_interference_size)` help to fix that. Finally, even though both could be fine, this is not sufficient to ensure there is no false sharing: AFAIK some processors load cache lines by block, like Intel one which load 128 bytes of data. See: https://stackoverflow.com/a/52158819/12939557 . — Jérôme Richard, Apr 06 '22 at 16:44

Gilles · Accepted Answer · 2022-04-07T07:25:49.897

As already pointed out in the various comments, the crux of your problem is false sharing. Indeed, your example is the typical case where one can experiment this. However, there are also quite a few issues in your code, such as:

You will likely see overflows, both in your loops variable and in all of your j and k tables;
Your timer isn't really the best choice ever (admittedly, that is a bit pedantic from my part in this case);
You do not make use of the values you computed, which allows the compiler to completely ignore the various computations;
Your two loops are not equivalent and won't give the same results; In order to get it right, I went back to your original i%4 formula and added a schedule( static, 1) clause. This is not a proper way of doing it, but it was only to get the expected results without using the correct reduction clause.

Then I rewrote your example and also augmented it with what I believe is a better solution to the false sharing issue: using a reduction clause.

#include <iostream>
#include <omp.h>

int main() {
    const long long loops = 1e10;
    long long j[4] = { 1, 2, 3, 4 };
    double time = omp_get_wtime();
    for ( long long i = 0; i < loops; i++ ) {
         j[i % 4] += 2;
    }
    std::cout << "sequential: " << omp_get_wtime() - time << std::endl;

    time = omp_get_wtime();
    long long k[4] = { 1, 2, 3, 4 };
    #pragma omp parallel for num_threads( 4 ) schedule( static, 1 )
    for ( long long i = 0; i < loops; i++ ) {
        k[i%4] += 2;
    }
    std::cout << "false sharing: " << omp_get_wtime() - time << std::endl;

    time = omp_get_wtime();
    long long l[4] = { 1, 2, 3, 4 };
    #pragma omp parallel for num_threads( 4 ) reduction( +: l[0:4] )
    for ( long long i = 0; i < loops; i++ ) {
        l[i%4] += 2;
    }
    std::cout << "reduction: " << omp_get_wtime() - time << std::endl;

    bool a = j[0]==k[0] && j[1]==k[1] && j[2]==k[2] && j[3]==k[3];
    bool b = j[0]==l[0] && j[1]==l[1] && j[2]==l[2] && j[3]==l[3];
    std::cout << "sanity check: " << a << " " << b << std::endl;

    return 0;
}

Compiling and running without optimizations gives on my laptop:

$ g++ -O0 -fopenmp false.cc 
$ ./a.out 
sequential: 15.5384
false sharing: 47.1417
reduction: 4.7565
sanity check: 1 1

Which speaks for itself in regard to the improvement the reduction clause brings. Now, enabling optimizations from the compiler gives a more mitigated picture:

$ g++ -O3 -fopenmp false.cc 
$ ./a.out 
sequential: 4.8414
false sharing: 4.10714
reduction: 2.10953
sanity check: 1 1

If anything, that shows that compilers are quite good at avoiding most of false sharing nowadays. Indeed, with your initial (erroneous) k[omp_get_thread_num()], there was no time difference with and without the reduction clause, showing that the compiler was able to avoid the issue.

Thanks for the answer! What about using `(static, 8)` scheduling? Theoretically, each thread is then working with a contituous block of memory sized 64 bytes each (presuming `long long` is 8 bytes, I omit the fact that its size may vary for the sake of simplicity here). Should this one work? I'm asking because in my case it didn't — Kaiyakha, Apr 07 '22 at 09:49
The `schedule( static, 1 )` I added has nothing to do with preventing false sharing. It was there only to ensure that each thread was accessing only the elements of the array `k` that corresponded to its own rank. Without it, the default schedule would have been (with my compiler) a simple `static` one, with contiguous chunks of indexes of size `loops/4`. Then, each thread would have accessed all possible indexes of `k` for incrementing. In this case, one would need to protect these accesses between threads with either `atomic` or `critical`, which essentially re-serializes the parallel loop — Gilles, Apr 07 '22 at 12:07
In your case, unless your actual code differs significantly from your snippet, the `reduction` clause is the way to go — Gilles, Apr 07 '22 at 12:08

OpenMP parallel loop much slower than regular loop

1 Answers1