Remark beforehand:
Handling the way the loop is divided "by hand" I believe is counterproductive (unless you want to understand how OpenMP works).
That's why I first propose you use the more standard approach with reduction
operation. You can always check that it gives the same result in terms of performance.
Another remark is that using throughout your code omp_
functions will not able you to compile it without -openmp
option.
Benching
So I benched with the following code:
headers
#include <iostream>
#include <fstream>
#include <omp.h>
#include <cmath>
#include <chrono>
#include <iomanip>
. test function with a very simple add operation
void test_simple(long long int N, int * ary, double & sum, long long int & elapsed_milli)
{
std::chrono::time_point<std::chrono::high_resolution_clock> start, end;
start = std::chrono::system_clock::now();
double local_sum = 0.0;
#pragma omp parallel
{
#pragma omp for reduction(+:local_sum)
for (long long int i = 0; i < N; i++) {
local_sum += ary[i];
}
}
sum = local_sum;
end = std::chrono::system_clock::now();
elapsed_milli = std::chrono::duration_cast<std::chrono::microseconds>
(end-start).count();
}
. test function with a complex, CPU intensive operation sign(x) atan(sqrt(cos(x)^2 + sin(0.5x)^2)
void test_intensive(long long int N, int * ary, double & sum, long long int & elapsed_milli)
{
std::chrono::time_point<std::chrono::high_resolution_clock> start, end;
start = std::chrono::system_clock::now();
double local_sum = 0.0;
#pragma omp parallel
{
double c, s;
#pragma omp for reduction(+:local_sum)
for (long long int i = 0; i < N; i++) {
c = cos(double(ary[i]));
s = sin(double(ary[i])*0.5);
local_sum += atan(sqrt(c*c+s*s));
}
}
sum = local_sum;
end = std::chrono::system_clock::now();
elapsed_milli = std::chrono::duration_cast<std::chrono::microseconds>
(end-start).count();
}
. Main function
using namespace std;
int main() {
long long int N = 1073741825,i;
int * ary = new int[N];
srand (0);
for (i = 0; i < N; i++) { ary[i] = rand()-RAND_MAX/2; }
double sum = 0.0;
sum = 0.0;
long long int elapsed_milli;
cout <<"#"<<setw(19)<<"N"<<setw(20)<<"µs"<< endl;
for(i=128; i<N; i=i*2)
{
test_intensive(i, ary, sum, elapsed_milli);
//test_simple(i, ary, sum, elapsed_milli);
cout << setw(20)<<i<<setw(20)<<elapsed_milli << setw(20)<<sum<<endl;
}
}
Compile (using icpc)
Sequential (No OpenMP) version is compiled with :
icpc test_omp.cpp -O3 --std=c++0x
OpenMP (OpenMP) version is compiled with :
icpc test_omp.cpp -O3 --std=c++0x -openmp
Measurement
Time measurements are done with chrono
using high_precision_clock
and the limit precision on my machine is microseconds hence the use of std::chrono::microseconds
(no point looking for higher precision)
Graph for the simple operation (axis are in log scale !)

Graph for the complex operation (axis are in log scale !)

Conclusions drawn
- There is an offset the first time using OpenMP (the first
#pragma omp
crossed) because the pool thread must be set in place.
If we take a closer look at the 'intensive case' the first time we enter the test_
function (with i=128) the time cost is way higher in the OpenMP case than the in No OpenMP case. At the second call (with i=256) we dont see the benefit of using OpenMP but the timings are coherent.

We can see that we do not observe scalability with a small number of samples. It's clearer in the simple test case. In other words the amount of operations inside a parallel section must be high enough to render the time needed for thread pool management negligable. Otherwise there is no point dividing the operation into threads.
In this case (with the processor I used) the minimum number of samples is around 100000. But if I were to use 256 threads it would surely be around 6000000.
- However for more CPU intensive operations using OpenMP can induce speed up even with 1000 samples (with the processor I used)
Summary
- If you bench an OpenMP code try to set up the pool thread beforehand with a simple operation with #pragma omp parallel. In your test case the setting up takes most of the time.
- Using OpenMP is a catch only if you parallelize sufficently CPU intensive functions (which is not really the case of a simple array sum...). For example this is the reason why in nested loops the
#pragma omp for
should always be in the outermost "possible" loop.