With g++, I have often achieved effective parallel speed improvements using simple openmp annotations, but not with the following near trivial vectorization example. In am using g++ 7 on Ubuntu 16.04, and understand that openmp comes with the compiler.
#include <iostream>
#include <omp.h>
#include <chrono>
#include <ctime>
#include <random>
int main() {
using namespace std;
using namespace std::chrono;
const unsigned N = 10000000;
float* table1 = new float[N];
float* table2 = new float[N];
float* table3 = new float[N];
std::mt19937 RND(123);
std::uniform_real_distribution<float> dist(0,1);
for (unsigned n = 0; n < N; ++n) { /*Initialize table1 and table2*/
table1[n]=dist(RND);
table2[n]=dist(RND);
}
auto start = duration_cast<milliseconds>(system_clock::now().time_since_epoch());
for(unsigned k=0;k<500;k++) { /*Do inner loop a lot*/
//#pragma omp parallel for
//#pragma omp simd
for (unsigned n = 0; n < N; ++n) /*VECTORIZE ME*/
{
table3[n]=table1[n]+table2[n];
}
}
auto end = duration_cast<milliseconds>(system_clock::now().time_since_epoch());
std::cout << "Time " << end.count() - start.count() << std::endl;
for (unsigned n = 0; n < N; ++n) { /*Use* the result.*/
if (abs(table3[n]-(table1[n]+table2[n]))>0.01) {
throw false;
}
}
delete table1; delete table2; delete table3;
}
For a baseline, compiling g++ -o "openmp-sandpit" "openmp-sandpit.cpp"
and running, yields a time of 14662ms and top
s at 25% (I have I7 with four processors, and am running top
with irix mode off).
Next we invoke -O1
, -O2
and -O3
, achieving 8524ms, 7473ms and 7376ms, respectively, all with top
25%.
- Secondary Question #1 Has g++ made use of SIMD vectorization in achieving these optimizations?
Next we uncomment #pragma omp parallel for
and compile with the -fopenmp
, achieving 7553ms and a top
of 100%. Additionally adding g++ optimization flags -O1
, -O2
and -O3
, achieves 8411ms, 7463ms and 7415ms, respectively, all with top
just below 100%.
Notice that openmp on four cores (top 100%) achieves 7553ms which is worse than vanilla g++ at -O2
and -O3
, and similar to -O1
.
- Secondary Question #2 Why is openmp,when using all four cores (
top
100%), outperformed by optimized g++ on a single core (top
25%)?
Finally, replacing the comment//#pragma omp parallel for
, uncommenting #pragma omp simd
and compiling with the single option -fopenmp-simd
achieves (a terrible) 15006ms with an expected top
of 25%. Additionally adding g++ optimization flags -O1
, -O2
and -O3
, achieves 7911ms, 7350ms and 7364ms, respectively, all with top
of 25%.
- Main Question What is wrong with my openmp-simd code? Why is it not vectorizing?
If I could vectorize the inner n
loop (openmp-simd), I could then parallelize the outer k
loop (openmp), and should get a 2x-4x speed up for the outer loop (over the four cores), and a 4x-8x speed up for the inner loop (SIMD on each core), achieving a 8x-32x speed improvement. Surely?
[... It appears that g++ vectorization is turned on by default on -O3. This I have tested and verified ...]
[... The best result obtains, by avoiding openmp-simd. The following code uses openmp to split to 4 cores and the g++ auto vectorization.
#pragma omp parallel for
for(int k=0;k<500;k++) { /*Do inner loop a lot*/
for (int n = 0; n < N; ++n) /*VECTORIZE ME*/
{
table3[n]=table1[n]+table2[n];
}
}
Compiling g++-7 **-O3** **-march=native** **-fopenmp**
(thanks to @Marc Glisse for -march=native) yields 3912. No other combination comes close.
...]