Vectorization with gcc and openmp

Question

With g++, I have often achieved effective parallel speed improvements using simple openmp annotations, but not with the following near trivial vectorization example. In am using g++ 7 on Ubuntu 16.04, and understand that openmp comes with the compiler.

#include <iostream>
#include <omp.h>
#include <chrono>
#include <ctime>
#include <random>

int main() {
    using namespace std;
    using namespace std::chrono;

    const unsigned N = 10000000;
    float* table1 = new float[N];
    float* table2 = new float[N];
    float* table3 = new float[N];

    std::mt19937 RND(123);
    std::uniform_real_distribution<float> dist(0,1);

    for (unsigned n = 0; n < N; ++n) { /*Initialize table1 and table2*/
        table1[n]=dist(RND);
        table2[n]=dist(RND);
    }

    auto start = duration_cast<milliseconds>(system_clock::now().time_since_epoch());
    
    for(unsigned k=0;k<500;k++) { /*Do inner loop a lot*/

        //#pragma omp parallel for
        //#pragma omp simd
        for (unsigned n = 0; n < N; ++n) /*VECTORIZE ME*/
        { 
            table3[n]=table1[n]+table2[n];
        }

    }

    auto end = duration_cast<milliseconds>(system_clock::now().time_since_epoch());
    std::cout << "Time " << end.count() - start.count() << std::endl;
        
    for (unsigned n = 0; n < N; ++n) { /*Use* the result.*/
        if (abs(table3[n]-(table1[n]+table2[n]))>0.01) {
            throw false;
        }
    }

    delete table1; delete table2; delete table3;
}

For a baseline, compiling g++ -o "openmp-sandpit" "openmp-sandpit.cpp" and running, yields a time of 14662ms and tops at 25% (I have I7 with four processors, and am running top with irix mode off).

Next we invoke -O1, -O2 and -O3, achieving 8524ms, 7473ms and 7376ms, respectively, all with top 25%.

Secondary Question #1 Has g++ made use of SIMD vectorization in achieving these optimizations?

Next we uncomment #pragma omp parallel for and compile with the -fopenmp, achieving 7553ms and a top of 100%. Additionally adding g++ optimization flags -O1, -O2 and -O3, achieves 8411ms, 7463ms and 7415ms, respectively, all with top just below 100%.

Notice that openmp on four cores (top 100%) achieves 7553ms which is worse than vanilla g++ at -O2 and -O3, and similar to -O1.

Secondary Question #2 Why is openmp,when using all four cores (top 100%), outperformed by optimized g++ on a single core (top 25%)?

Finally, replacing the comment//#pragma omp parallel for, uncommenting #pragma omp simd and compiling with the single option -fopenmp-simd achieves (a terrible) 15006ms with an expected top of 25%. Additionally adding g++ optimization flags -O1, -O2 and -O3, achieves 7911ms, 7350ms and 7364ms, respectively, all with top of 25%.

Main Question What is wrong with my openmp-simd code? Why is it not vectorizing?

If I could vectorize the inner n loop (openmp-simd), I could then parallelize the outer k loop (openmp), and should get a 2x-4x speed up for the outer loop (over the four cores), and a 4x-8x speed up for the inner loop (SIMD on each core), achieving a 8x-32x speed improvement. Surely?

[... It appears that g++ vectorization is turned on by default on -O3. This I have tested and verified ...]

[... The best result obtains, by avoiding openmp-simd. The following code uses openmp to split to 4 cores and the g++ auto vectorization.

#pragma omp parallel for
    for(int k=0;k<500;k++) { /*Do inner loop a lot*/

        for (int n = 0; n < N; ++n) /*VECTORIZE ME*/
        { 
            table3[n]=table1[n]+table2[n];
        }

    }

Compiling g++-7 **-O3** **-march=native** **-fopenmp** (thanks to @Marc Glisse for -march=native) yields 3912. No other combination comes close. ...]

Try to change the `N` from unsigned to signed, as well as `n` in your inner loop when trying to use OMP `parallel for` pragma, worked for me under MSVC (accelerated twice from vanilla version) — kreuzerkrieg, Oct 08 '17 at 09:32
I tried that for both the openmp and the openmp-simd versions and no change. Thanks for trying. — fundagain, Oct 08 '17 at 09:36
Don't forget `-march=native` to use non-trivial vector instructions... — Marc Glisse, Oct 08 '17 at 10:28
#1: `-O3 -fno-tree-vectorize` is slower, so yes it did benefit from vectorization. `-fopt-info` lists the vectorized loops. — Marc Glisse, Oct 08 '17 at 10:32
@Marc Glisse Thanks. I have just explored -march=native and it is certainly helping with g++ auto vectorization. I can see the change in the assembly instructions. This is leading me to opt for openmp divide into cores and then g++ auto vectorization per core. I have read that openmp turns off gcc auto vectorization, but I will explore. — fundagain, Oct 08 '17 at 10:35
#3 Of course you get terrible performance if you don't pass any `-O*` flag, -fopenmp-simd doesn't replace them, it comes on top of them and is useless without them. — Marc Glisse, Oct 08 '17 at 10:36
Is the answer to #2 that openmp is spreading to 4 cores but now gcc is no longer auto vectorizing? — fundagain, Oct 08 '17 at 10:39
#2 the overhead (including worse optimization) of using openmp threads is too much, for too little work in each thread. Make N 100 times larger to see some benefit. — Marc Glisse, Oct 08 '17 at 10:41
omp simd shouldn't have any advantage here. As the arrays are defined locally, there is no aliasing to suppress either by that pragma or by __restrict ( which might often be preferred). It's true that g++ apparently doesn't implement omp parallel simd but it's hard to draw conclusions when you have that k loop which the compiler is entitled to eliminate. — tim18, Oct 08 '17 at 20:24
The k loop is effective in all cases. I agree it can be removed, but it is not. Hence I can use it for timing. I will change the loop to make it "consequential" and ensure I depend on the "consequence" but suspect the result will be the same. — fundagain, Oct 08 '17 at 20:53
Secondary Question #1 : Use -S to see assembly code and if `xmm`, `ymm` or `zmm` registers are used for each section you can understand it is SIMDized or not. Secondary Question #2: you must run your program many times and get the smallest time see this timing [method](https://stackoverflow.com/questions/43579555/how-to-measure-the-elapsead-time-below-nanosecond-for-x86) — Amiri, Oct 08 '17 at 23:25
Use `-fopt-info-vec-omp-optimized-missed` to get a vectorization report. — Trass3r, Jan 12 '18 at 00:04

Vectorization with gcc and openmp

0 Answers0