1

I am trying to find sum of elements in an array as I show below. However, the OpenMP implementation surprisingly is slower than the sequential implementation. I tried with both heap allocated and stack allocated arrays and got similar results. Any help is greatly appreciated.

#include <iostream>
#include <omp.h>
int main() {
  int N = 10000;
  int * ary = new int[N];
  for (int i = 0; i < N; i++) { input_file >> ary[i]; }
  int sum = 0;
  clock_t begin = clock();
  for (int i = 0; i < N; i++) { sum += ary[i]; }
  clock_t end = clock();
  cout << sum;
  double elapsed_time = double(end - begin) / CLOCKS_PER_SEC;
  sum = 0;
  begin = clock();
  #pragma omp parallel
  {
    int thread_id = omp_get_thread_num();
    int total_threads = omp_get_num_threads();
    int elem_per_thread = N / total_threads;
    int base = thread_id * elem_per_thread;
    int internal_sum = 0;
    for (int i = base; i < (base + elem_per_thread); i++) {
      internal_sum += ary[i];
    }
    #pragma omp critical
    {
      sum += internal_sum;
    }
  }
  end = clock();
  cout << sum;
  elapsed_time = double(end - begin) / CLOCKS_PER_SEC;    
}

The sequential program takes 5e-06 (s) to finish and the parallel one takes 0.001733 (s). I am compiling on Ubuntu 16.04 using g++ -std=c++11 main.cpp -fopenmp -O3 && ./a.out

mmotamedi144
  • 115
  • 1
  • 10

3 Answers3

4

The sequential program optimizes down to doing nothing. This is because the only side effect is the value of sum, and the value of sum is not observable in your program.

With the OpenMP, the complexity of threading things off prevents the compiler from realizing you aren't doing anything.

A simple way that could avoid this is add return sum; now it shows up as an exit code, which is observable, and hence the calculation cannot be optimized away.

Now, the compiler is still free to never allocate ary, because it can prove that ary[i]==i for all i, and replace reading ary[i] with just i, then calculate at compile time that the sum of i from 1 to 10000 is 50005000, eliminate your entire loop and make it a sum=50005000 and still take zero time.

463035818_is_not_an_ai
  • 109,796
  • 11
  • 89
  • 185
Yakk - Adam Nevraumont
  • 262,606
  • 27
  • 330
  • 524
  • Thank you for your help. This is a minimal program. I the original version I print the sum to avoid that but still have the same result. – mmotamedi144 Oct 16 '18 at 17:24
  • @mmotamedi144 Paragraph added while you where typing that. Your stuff is so simple that a program could work out what the result is without doing anything at runtime. To determine if this is happening, examine the assembly of an actual [mcve]. – Yakk - Adam Nevraumont Oct 16 '18 at 17:27
  • 2
    @mmotamedi144 even if you do print the sum, the compiler generated code might not do what you expect. Compilers are quite clever, for example a loop like this `int x=0; for (int i=0;i – 463035818_is_not_an_ai Oct 16 '18 at 17:28
  • @Yakk-AdamNevraumont, Thank you for your post. Well, I changed the code and read the value from an input file and print the value of sum and check it. Now, there is no way for compiler to optimize anything away. Still, the sequential is much faster. – mmotamedi144 Oct 16 '18 at 17:33
  • 1
    The program is reading the input from a file. @Yakk-AdamNevraumont – MTMD Oct 16 '18 at 18:18
2

Remark beforehand:
Handling the way the loop is divided "by hand" I believe is counterproductive (unless you want to understand how OpenMP works). That's why I first propose you use the more standard approach with reduction operation. You can always check that it gives the same result in terms of performance.
Another remark is that using throughout your code omp_ functions will not able you to compile it without -openmp option.

Benching

So I benched with the following code:

headers

#include <iostream>
#include <fstream>
#include <omp.h>
#include <cmath>
#include <chrono>
#include <iomanip>

. test function with a very simple add operation

void test_simple(long long int N, int * ary, double & sum, long long int & elapsed_milli)
{
  std::chrono::time_point<std::chrono::high_resolution_clock> start, end;
  start = std::chrono::system_clock::now();
  double local_sum = 0.0;
  #pragma omp parallel
  {
    #pragma omp for reduction(+:local_sum)
    for (long long int i = 0; i < N; i++) {
      local_sum += ary[i];
    }
  }
  sum = local_sum;
  end = std::chrono::system_clock::now();
  elapsed_milli = std::chrono::duration_cast<std::chrono::microseconds>
                             (end-start).count();
}

. test function with a complex, CPU intensive operation sign(x) atan(sqrt(cos(x)^2 + sin(0.5x)^2)

void test_intensive(long long int N, int * ary, double & sum, long long int & elapsed_milli)
{
  std::chrono::time_point<std::chrono::high_resolution_clock> start, end;
  start = std::chrono::system_clock::now();
  double local_sum = 0.0;
  #pragma omp parallel
  {
    double c, s;
    #pragma omp for reduction(+:local_sum)
    for (long long int i = 0; i < N; i++) {
      c = cos(double(ary[i]));
      s = sin(double(ary[i])*0.5);
      local_sum += atan(sqrt(c*c+s*s));
    }
  }
  sum = local_sum;
  end = std::chrono::system_clock::now();
  elapsed_milli = std::chrono::duration_cast<std::chrono::microseconds>
                             (end-start).count();  
}

. Main function

using namespace std;
int main() {
  long long int N = 1073741825,i;
  int * ary = new int[N];
  srand (0);
  for (i = 0; i < N; i++) { ary[i] = rand()-RAND_MAX/2; }
  double sum = 0.0;
  sum = 0.0;
  long long int  elapsed_milli;
  cout <<"#"<<setw(19)<<"N"<<setw(20)<<"µs"<< endl;
  for(i=128; i<N; i=i*2)
  {
      test_intensive(i, ary, sum, elapsed_milli);
      //test_simple(i, ary, sum, elapsed_milli);
      cout << setw(20)<<i<<setw(20)<<elapsed_milli << setw(20)<<sum<<endl;
  }
}

Compile (using icpc)
Sequential (No OpenMP) version is compiled with :

icpc test_omp.cpp -O3 --std=c++0x  

OpenMP (OpenMP) version is compiled with :

icpc test_omp.cpp -O3 --std=c++0x -openmp

Measurement
Time measurements are done with chrono using high_precision_clock and the limit precision on my machine is microseconds hence the use of std::chrono::microseconds (no point looking for higher precision)

Graph for the simple operation (axis are in log scale !)
CPU simple

Graph for the complex operation (axis are in log scale !)
CPU intensive

Conclusions drawn

  • There is an offset the first time using OpenMP (the first #pragma omp crossed) because the pool thread must be set in place.
    If we take a closer look at the 'intensive case' the first time we enter the test_ function (with i=128) the time cost is way higher in the OpenMP case than the in No OpenMP case. At the second call (with i=256) we dont see the benefit of using OpenMP but the timings are coherent. CPU intensive close up
  • We can see that we do not observe scalability with a small number of samples. It's clearer in the simple test case. In other words the amount of operations inside a parallel section must be high enough to render the time needed for thread pool management negligable. Otherwise there is no point dividing the operation into threads.

  • In this case (with the processor I used) the minimum number of samples is around 100000. But if I were to use 256 threads it would surely be around 6000000.

  • However for more CPU intensive operations using OpenMP can induce speed up even with 1000 samples (with the processor I used)

Summary

  • If you bench an OpenMP code try to set up the pool thread beforehand with a simple operation with #pragma omp parallel. In your test case the setting up takes most of the time.
  • Using OpenMP is a catch only if you parallelize sufficently CPU intensive functions (which is not really the case of a simple array sum...). For example this is the reason why in nested loops the #pragma omp for should always be in the outermost "possible" loop.
PilouPili
  • 2,601
  • 2
  • 17
  • 31
  • A single critical region outside of the loop can't really explain the timing difference, so I doubt the reduction fixes. – Max Langhof Oct 16 '18 at 17:32
  • Thank you @MaxLanghof. In addition, this is a minimal program of what I am actually working on. I need to figure out the problem to continue my work. That is, reduction is not all I need. – mmotamedi144 Oct 16 '18 at 17:34
  • I have edited my answer because the first one was just a recommendation. Sorry about that @MaxLanghof – PilouPili Oct 17 '18 at 16:29
1

As Max Langhof and user463035818 suggested, the program is memory bounded. I changed the program to do something more than accumulation. That is, I changed sum += ary[i] to sum += (pow(ary[i], 1.1) + pow(ary[i], 1.2)) / 100000000.0 and performed the same change in the parallel program and measured the time. The parallel program became 2X faster. If the program is IO bounded, I guess there is not much I can do about it to make it faster with OpenMP. Please let me know if you think otherwise.

mmotamedi144
  • 115
  • 1
  • 10