OpenMP vectorised code runs way slower than O3 optimized code

Question

I have a minimally reproducible sample which is as follows -

#include <iostream>
#include <chrono>
#include <immintrin.h>
#include <vector>
#include <numeric>



template<typename type>
void AddMatrixOpenMP(type* matA, type* matB, type* result, size_t size){
        for(size_t i=0; i < size * size; i++){
            result[i] = matA[i] + matB[i];
        }
}


int main(){
    size_t size = 8192;

    //std::cout<<sizeof(double) * 8<<std::endl;
    

    auto matA = (float*) aligned_alloc(sizeof(float), size * size * sizeof(float));
    auto matB = (float*) aligned_alloc(sizeof(float), size * size * sizeof(float));
    auto result = (float*) aligned_alloc(sizeof(float), size * size * sizeof(float));


    for(int i = 0; i < size * size; i++){
        *(matA + i) = i;
        *(matB + i) = i;
    }

    auto start = std::chrono::high_resolution_clock::now();

    for(int j=0; j<500; j++){
    
    AddMatrixOpenMP<float>(matA, matB, result, size);
    
}

    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();

    std::cout<<"Average Time is = "<<duration/500<<std::endl;
    std::cout<<*(result + 100)<<"  "<<*(result + 1343)<<std::endl;

}

I experiment as follows - I time the code with #pragma omp for simd directive for the loop in the AddMatrixOpenMP function and then time it without the directive. I compile the code as follows - g++ -O3 -fopenmp example.cpp

Upon inspecting the assembly, both the variants generate vector instructions but when the OpenMP pragma is explicitly specified, the code runs 3 times slower.
I am not able to understand why so.

Edit - I am running GCC 9.3 and OpenMP 4.5. This is running on an i7 9750h 6C/12T on Ubuntu 20.04. I ensured no major processes were running in the background. The CPU frequency held more or less constant during the run for both versions (Minor variations from 4.0 to 4.1)

TIA

What happens if you replace `#pragma omp for simd` with `#pragma omp parallel for`? — jjramsey, Jun 11 '21 at 13:08
@Brannon: Note that `gcc -O2 -fopenmp` doesn't do auto-vectorization by default, *only* for loops where you use `omp simd`. But `gcc -O3` enables `-ftree-vectorize`, which tries to vectorize any/all loops. However, I would have expected the OpenMP vectorizer to do at least as well as the normal vectorizer algorithm for this simple pure vertical SIMD loop. Not obvious what could explain 3x slower. — Peter Cordes, Jun 11 '21 at 13:37
Thank you for replying @jjramsey, Using #pragma parallel for it does get parallelized to all my 12 threads. Interesting you bought that up, I remember using `#pragma omp parallel for simd` and it took a little more time than just simd, though i suppose that can be attributed to thread overload — Atharva Dubey, Jun 11 '21 at 13:38
What hardware did you test on? (And what OS, in case that matters)? I assume you controlled for "warm up" effects like CPU frequency? You don't touch `results` before the first pass, so its going to suffer page-faults inside the timed region, but that should be the same for both versions unless you add another timed region reusing the data in the same program. (If you did that, see [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987)) — Peter Cordes, Jun 11 '21 at 13:43
I was going to test this but it doesn't compile. You seem to have left out the `#include`s from your [mcve]. After fixing that and adding the missing `#pragma omp simd`, yeah on i7-6700k Skylake (3.9GHz with DDR4-2666) with GCC 10.2 -O3 (without `-march=native` or `-fopenmp`), I get 18266, but with `-O3 -fopenmp` I get avg time 39772. — Peter Cordes, Jun 11 '21 at 13:43
@PeterCordes, YES!! those are the results I get too!. I have updated the question with the details. On a side note, could you please elaborate on that page fault issue, I am not quite sure what that means. Also would I be right in expecting a similar runtime for both the versions ? — Atharva Dubey, Jun 11 '21 at 13:51
@Brannon, thanks a lot for your reply. As this post suggests - https://stackoverflow.com/questions/61154047/pragma-omp-for-simd-does-not-generate-vector-instructions-in-gcc Atleast -O2 must be used to use OpenMP. Using -O3 I expect both the data initialization loop and addition loop to get vectorized, and when I use the OpenMP pragma, the addition loop will always get vectorized. Since both the versions output vector instructions (I checked the assembly) I was expecting similar runtimes, but almost a 3 times difference is a lot and I am not sure what exactly is happening underneath. — Atharva Dubey, Jun 11 '21 at 13:52
I also experienced several times that autovectorization with some help (such as #pragma ivdep) produces more efficient code than setting up the #pragma omp simd directives. My only guess is that openMP interferes with the optimizer of the compiler... — Laci, Jun 11 '21 at 14:00
What happens if you don't use the `simd` part, and just use `#pragma omp parallel for`? — jjramsey, Jun 11 '21 at 14:01
Thank you @Laci for replying, But since both are producing vector instructions, shouldn't the times be more or less similar. I mean a little bit here and there can be attributed to different instructions/ order of instructions being used. But 3 times is a lot of difference. — Atharva Dubey, Jun 11 '21 at 14:05

Peter Cordes · Accepted Answer · 2021-06-11T14:46:55.577

The non-OpenMP vectorizer is defeating your benchmark with loop inversion.
Make your function __attribute__((noinline, noclone)) to stop GCC from inlining it into the repeat loop. For cases like this with large enough functions that call/ret overhead is minor, and constant propagation isn't important, this is a pretty good way to make sure that the compiler doesn't hoist work out of the loop.

And in future, check the asm, and/or make sure the benchmark time scales linearly with the iteration count. e.g. increasing 500 up to 1000 should give the same average time in a benchmark that's working properly, but it won't with -O3. (Although it's surprisingly close here, so that smell test doesn't definitively detect the problem!)

After adding the missing #pragma omp simd to the code, yeah I can reproduce this. On i7-6700k Skylake (3.9GHz with DDR4-2666) with GCC 10.2 -O3 (without -march=native or -fopenmp), I get 18266, but with -O3 -fopenmp I get avg time 39772.

With the OpenMP vectorized version, if I look at top while it runs, memory usage (RSS) is steady at 771 MiB. (As expected: init code faults in the two inputs, and the first iteration of the timed region writes to result, triggering page-faults for it, too.)

But with the "normal" vectorizer (not OpenMP), I see the memory usage climb from ~500 MiB until it exits just as it reaches the max 770MiB.

So it looks like gcc -O3 performed some kind of loop inversion after inlining and defeated the memory-bandwidth-intensive aspect of your benchmark loop, only touching each array element once.

The asm shows the evidence: GCC 9.3 -O3 on Godbolt doesn't vectorize, and it leaves an empty inner loop instead of repeating the work.

.L4:                    # outer loop
        movss   xmm0, DWORD PTR [rbx+rdx*4]
        addss   xmm0, DWORD PTR [r13+0+rdx*4]        # one scalar operation
        mov     eax, 500
.L3:                             # do {
        sub     eax, 1                   # empty inner loop after inversion
        jne     .L3              # }while(--i);

        add     rdx, 1
        movss   DWORD PTR [rcx], xmm0
        add     rcx, 4
        cmp     rdx, 67108864
        jne     .L4

This is only 2 or 3x faster than fully doing the work. Probably because it's not vectorized, and it's effectively running a delay loop instead of optimizing away the empty inner loop entirely. And because modern desktops have very good single-threaded memory bandwidth.

Bumping up the repeat count from 500 to 1000 only improved the computed "average" from 18266 to 17821 us per iter. An empty loop still takes 1 iteration per clock. Normally scaling linearly with the repeat count is a good litmus test for broken benchmarks, but this is close enough to be believable.

There's also the overhead of page faults inside the timed region, but the whole thing runs for multiple seconds so that's minor.

The OpenMP vectorized version does respect your benchmark repeat-loop. (Or to put it another way, doesn't manage to find the huge optimization that's possible in this code.)

Looking at memory bandwidth while the benchmark is running:

Running intel_gpu_top -l while the proper benchmark is running shows (openMP, or with __attribute__((noinline, noclone))). IMC is the Integrated Memory Controller on the CPU die, shared by the IA cores and the GPU via the ring bus. That's why a GPU-monitoring program is useful here.

$ intel_gpu_top -l
 Freq MHz      IRQ RC6 Power     IMC MiB/s           RCS/0           BCS/0           VCS/0          VECS/0 
 req  act       /s   %     W     rd     wr       %  se  wa       %  se  wa       %  se  wa       %  se  wa 
   0    0        0  97  0.00  20421   7482    0.00   0   0    0.00   0   0    0.00   0   0    0.00   0   0 
   3    4       14  99  0.02  19627   6505    0.47   0   0    0.00   0   0    0.00   0   0    0.00   0   0 
   7    7       20  98  0.02  19625   6516    0.67   0   0    0.00   0   0    0.00   0   0    0.00   0   0 
  11   10       22  98  0.03  19632   6516    0.65   0   0    0.00   0   0    0.00   0   0    0.00   0   0 
   3    4       13  99  0.02  19609   6505    0.46   0   0    0.00   0   0    0.00   0   0    0.00   0   0

Note the ~19.6GB/s read / 6.5GB/s write. Read ~= 3x write since it's not using NT stores for the output stream.

But with -O3 defeating the benchmark, with a 1000 repeat count, we see only near-idle levels of main-memory bandwidth.

 Freq MHz      IRQ RC6 Power     IMC MiB/s           RCS/0           BCS/0           VCS/0          VECS/0 
 req  act       /s   %     W     rd     wr       %  se  wa       %  se  wa       %  se  wa       %  se  wa 
...
   8    8       17  99  0.03    365     85    0.62   0   0    0.00   0   0    0.00   0   0    0.00   0   0 
   9    9       17  99  0.02    349     90    0.62   0   0    0.00   0   0    0.00   0   0    0.00   0   0 
   4    4        5 100  0.01    303     63    0.25   0   0    0.00   0   0    0.00   0   0    0.00   0   0 
   7    7       15 100  0.02    345     69    0.43   0   0    0.00   0   0    0.00   0   0    0.00   0   0 
  10   10       21  99  0.03    350     74    0.64   0   0    0.00   0   0    0.00   0   0    0.00   0   0

vs. a baseline of 150 to 180 MB/s read, 35 to 50MB/s write when the benchmark isn't running at all. (I have some programs running that don't totally sleep even when I'm not touching the mouse / keyboard.)

Thanks a lot for the detailed answer, this definitely helps – Atharva Dubey Jun 11 '21 at 14:43 — Atharva Dubey, Jun 11 '21 at 14:43

OpenMP vectorised code runs way slower than O3 optimized code

1 Answers1

Looking at memory bandwidth while the benchmark is running: