Why does "#pragma omp simd" only take big performance improvement in "-O2" under gcc compiler?

Question

Check the following code:

#include <stdio.h>
#include <omp.h>

#define ARRAY_SIZE  (1024)
float A[ARRAY_SIZE];
float B[ARRAY_SIZE];
float C[ARRAY_SIZE];

int main(void)
{   
    for (int i = 0; i < ARRAY_SIZE; i++)
    {
        A[i] = i * 2.3;
        B[i] = i + 4.6;
    }

    double start = omp_get_wtime();
    for (int loop = 0; loop < 1000000; loop++)
    {
        #pragma omp simd
        for (int i = 0; i < ARRAY_SIZE; i++)
        {
            C[i] = A[i] * B[i];
        }
    }
    double end = omp_get_wtime();
    printf("Work consumed %f seconds\n", end - start);
    return 0;
}

Build and run it on my machine, it outputs:

$ gcc -fopenmp parallel.c
$ ./a.out
Work consumed 2.084107 seconds

If I comment out "#pragma omp simd", build and run it again:

$ gcc -fopenmp parallel.c
$ ./a.out
Work consumed 2.112724 seconds

We can see "#pragma omp simd" doesn't get big performance gain. But if I add -O2 option, no "#pragma omp simd":

$ gcc -O2 -fopenmp parallel.c
$ ./a.out
Work consumed 0.446662 seconds

With "#pragma omp simd":

$ gcc -O2 -fopenmp parallel.c
$ ./a.out
Work consumed 0.126799 seconds

We can see a big improvement. But if use -O3, no "#pragma omp simd":

$ gcc -O3 -fopenmp parallel.c
$ ./a.out
Work consumed 0.127563 seconds

with "#pragma omp simd":

$ gcc -O3 -fopenmp parallel.c
$ ./a.out
Work consumed 0.126727 seconds

We can see the results are similar again.

Why does "#pragma omp simd" only take big performance improvement in -O2 under gcc compiler?

Looks like the compiler further optimize your code when using O3 and likely to take advantage of simd instructions. Have you compared the resulting assemblies? — Harald, Dec 27 '17 at 10:08

Peter Cordes · Accepted Answer · 2017-12-27T23:59:23.667

10

Forget about timing with -O0, it's a total waste of time.

gcc -O3 attempts to auto-vectorize all loops, so using OpenMP pragmas only helps you for loops that otherwise would only auto-vectorize with -ffast-math, restrict qualifiers, or other obstacles to correctness under all possible circumstances which the compiler has to satisfy for auto-vectorization of pure C. (Apparently no obstacles here: here it's not a reduction, and you have purely vertical operations. And you're operating on static arrays so the compiler can see they don't overlap)

gcc -O2 does not enable -ftree-vectorize, so you only get auto-vectorization if you use OpenMP pragmas to ask for it on specific loops.

Note that clang enables auto-vectorization at -O2.

GCC auto-vectorization strategies may differ between OpenMP and vanilla. IIRC, for OpenMP loops, gcc may just use unaligned loads / stores instead of going scalar until reaching an alignment boundary. This has no perf downside with AVX if the data is aligned at runtime, even if that fact wasn't known at compile time. It saves a lot of code bloat vs. gcc's massive fully-unrolled startup / cleanup code.

It makes sense that if you're asking for SIMD vectorization with OpenMP, you've probably aligned your data to avoid cache-line splits. But C doesn't make it very convenient to pass along the fact that a pointer to float has more alignment than the width of a float. (Especially that it usually has that property, even if you need the function to still work in the rare cases when it doesn't).

edited Dec 27 '17 at 23:59

answered Dec 27 '17 at 10:12

Peter Cordes

328,167
45
605
847

2

gcc gives the pragma simd a smaller role than other compilers. It would still make the difference (at -O2/3) in a case where you omit an otherwise required restrict qualifier. – tim18 Dec 27 '17 at 19:44
@tim18: yes good point, that's part of the "or something" that applies to integer code, too. – Peter Cordes Dec 27 '17 at 19:45
2

I wish GCC and Clang default to at least `-O2` like ICC. I don't even see the point of `-O0 or -O1`. The few times I used a debugger the problem only appeared in the optimized code anyway. I have been wanting to ask this for a while but what's the point of `-O1`. – Z boson Dec 28 '17 at 09:00
I have found a case where using OpenMP causes GCC not to vectorize the code well. The code is [here](https://stackoverflow.com/a/43544233/2542702). With Clang there is no problem but to get the best results with GCC I had to move the `kernel` function to a separate object file (compiled without `-fopenmp`). So sometimes OpenMP can actually make the optimization worse. – Z boson Dec 28 '17 at 09:04
1

At the very least, `pragma omp simd` is useful to avoid using the compiler dependent alignment hint, when the alignment is stricter than the datatype. – Jorge Bellon Jan 25 '18 at 09:51

Why does "#pragma omp simd" only take big performance improvement in "-O2" under gcc compiler?

1 Answers1