Why OpenMP is slower than a sequential program for a simple reduction?

Question

I am trying to find sum of elements in an array as I show below. However, the OpenMP implementation surprisingly is slower than the sequential implementation. I tried with both heap allocated and stack allocated arrays and got similar results. Any help is greatly appreciated.

#include <iostream>
#include <omp.h>
int main() {
  int N = 10000;
  int * ary = new int[N];
  for (int i = 0; i < N; i++) { input_file >> ary[i]; }
  int sum = 0;
  clock_t begin = clock();
  for (int i = 0; i < N; i++) { sum += ary[i]; }
  clock_t end = clock();
  cout << sum;
  double elapsed_time = double(end - begin) / CLOCKS_PER_SEC;
  sum = 0;
  begin = clock();
  #pragma omp parallel
  {
    int thread_id = omp_get_thread_num();
    int total_threads = omp_get_num_threads();
    int elem_per_thread = N / total_threads;
    int base = thread_id * elem_per_thread;
    int internal_sum = 0;
    for (int i = base; i < (base + elem_per_thread); i++) {
      internal_sum += ary[i];
    }
    #pragma omp critical
    {
      sum += internal_sum;
    }
  }
  end = clock();
  cout << sum;
  elapsed_time = double(end - begin) / CLOCKS_PER_SEC;    
}

The sequential program takes 5e-06 (s) to finish and the parallel one takes 0.001733 (s). I am compiling on Ubuntu 16.04 using g++ -std=c++11 main.cpp -fopenmp -O3 && ./a.out

5e-06 seconds is hard to beat, do you really need it faster? Consider that adding more threads adds overhead that only pays off for more workload — 463035818_is_not_an_ai, Oct 16 '18 at 17:15
Also check the assembly on the serial version. You could be getting the loop optimized away if the result is never used. — NathanOliver, Oct 16 '18 at 17:16
@NathanOliver, thank you for your comment. No, the loop doesn't get optimized away since I print the result. — mmotamedi144, Oct 16 '18 at 17:17
@user463035818, I tried it with N = 1000000000. Still the sequential time (0.36348s) is much better than parallel (0.7) — mmotamedi144, Oct 16 '18 at 17:18
You should still check (if you haven't). I've seen compilers figure out the sum at compile time and turn the print operation into printing a constant value. Modern optimizers are really good. — NathanOliver, Oct 16 '18 at 17:19
If you are memory bound (I'd be surprised if you are not) then adding more threads won't help you. — Max Langhof, Oct 16 '18 at 17:19
I would rather be surprised to see a speedup with multithreading (even for large N). Single cores are rather good at adding numbers ;) — 463035818_is_not_an_ai, Oct 16 '18 at 17:25
@NathanOliver, just checked. The loop exists. I also changed the program to read the input from a file to avoid such optimization. — mmotamedi144, Oct 16 '18 at 17:26
Also highly relevant [`clock` typically measured CPU time](https://stackoverflow.com/a/10674970/620382), which is not what you want for parallel speedup. — Zulan, Oct 16 '18 at 17:26
did you already try to compare the times when you run the parallel version with different numbers of threads? Is this consistent? — 463035818_is_not_an_ai, Oct 16 '18 at 17:33
@user463035818, yes, with different number of threads the problem still exists — mmotamedi144, Oct 16 '18 at 17:37
@Zulan I tried with `omp_get_wtime()`. Sequential still much faster. — mmotamedi144, Oct 16 '18 at 17:38
@user463035818 1 thread: 9.894e-12, 2 threads 7.5108e-11, 4 threads 8.2791e-10, 8 threads 1.584e-10, 16 threads 3.04408e-10, 32 threads 6.00165e-10. — mmotamedi144, Oct 16 '18 at 17:41
@mmotamedi144 sounds reasonable. You mainly measure overhead ;) — 463035818_is_not_an_ai, Oct 16 '18 at 17:45
wait a second, in your question you quote very different numbers... — 463035818_is_not_an_ai, Oct 16 '18 at 17:46
@user463035818, yes initially I was measuring the time useing clock(), then I used `omp_get_wtime()` — mmotamedi144, Oct 16 '18 at 18:21
did you try to run first the paralell and then the sequential? — 463035818_is_not_an_ai, Oct 16 '18 at 18:23
well, spawning threads is expensive, barriers are expensive, adding numbers stored in contiguous memory is cheap, I dont think there is much more you can get out of it — 463035818_is_not_an_ai, Oct 16 '18 at 18:30
@MaxLanghof, Thank you Max. You are right. The program is IO bounded. I changed to program to have a slightly more arithmetic intensity and posted the results below. — mmotamedi144, Oct 16 '18 at 18:54

score 4 · Answer 1 · edited Oct 16 '18 at 17:34

4

The sequential program optimizes down to doing nothing. This is because the only side effect is the value of sum, and the value of sum is not observable in your program.

With the OpenMP, the complexity of threading things off prevents the compiler from realizing you aren't doing anything.

A simple way that could avoid this is add return sum; now it shows up as an exit code, which is observable, and hence the calculation cannot be optimized away.

Now, the compiler is still free to never allocate ary, because it can prove that ary[i]==i for all i, and replace reading ary[i] with just i, then calculate at compile time that the sum of i from 1 to 10000 is 50005000, eliminate your entire loop and make it a sum=50005000 and still take zero time.

edited Oct 16 '18 at 17:34

463035818_is_not_an_ai

109,796
11
89
185

answered Oct 16 '18 at 17:23

Yakk - Adam Nevraumont

262,606
27
330
524

Thank you for your help. This is a minimal program. I the original version I print the sum to avoid that but still have the same result. – mmotamedi144 Oct 16 '18 at 17:24
@mmotamedi144 Paragraph added while you where typing that. Your stuff is so simple that a program could work out what the result is without doing anything at runtime. To determine if this is happening, examine the assembly of an actual [mcve]. – Yakk - Adam Nevraumont Oct 16 '18 at 17:27
2

@mmotamedi144 even if you do print the sum, the compiler generated code might not do what you expect. Compilers are quite clever, for example a loop like this `int x=0; for (int i=0;i – 463035818_is_not_an_ai Oct 16 '18 at 17:28
@Yakk-AdamNevraumont, Thank you for your post. Well, I changed the code and read the value from an input file and print the value of sum and check it. Now, there is no way for compiler to optimize anything away. Still, the sequential is much faster. – mmotamedi144 Oct 16 '18 at 17:33
1

The program is reading the input from a file. @Yakk-AdamNevraumont – MTMD Oct 16 '18 at 18:18

PilouPili · Answer 2 · 2018-10-17T16:28:36.527

Remark beforehand:
Handling the way the loop is divided "by hand" I believe is counterproductive (unless you want to understand how OpenMP works). That's why I first propose you use the more standard approach with reduction operation. You can always check that it gives the same result in terms of performance.
Another remark is that using throughout your code omp_ functions will not able you to compile it without -openmp option.

Benching

So I benched with the following code:

headers

#include <iostream>
#include <fstream>
#include <omp.h>
#include <cmath>
#include <chrono>
#include <iomanip>

. test function with a very simple add operation

void test_simple(long long int N, int * ary, double & sum, long long int & elapsed_milli)
{
  std::chrono::time_point<std::chrono::high_resolution_clock> start, end;
  start = std::chrono::system_clock::now();
  double local_sum = 0.0;
  #pragma omp parallel
  {
    #pragma omp for reduction(+:local_sum)
    for (long long int i = 0; i < N; i++) {
      local_sum += ary[i];
    }
  }
  sum = local_sum;
  end = std::chrono::system_clock::now();
  elapsed_milli = std::chrono::duration_cast<std::chrono::microseconds>
                             (end-start).count();
}

. test function with a complex, CPU intensive operation sign(x) atan(sqrt(cos(x)^2 + sin(0.5x)^2)

void test_intensive(long long int N, int * ary, double & sum, long long int & elapsed_milli)
{
  std::chrono::time_point<std::chrono::high_resolution_clock> start, end;
  start = std::chrono::system_clock::now();
  double local_sum = 0.0;
  #pragma omp parallel
  {
    double c, s;
    #pragma omp for reduction(+:local_sum)
    for (long long int i = 0; i < N; i++) {
      c = cos(double(ary[i]));
      s = sin(double(ary[i])*0.5);
      local_sum += atan(sqrt(c*c+s*s));
    }
  }
  sum = local_sum;
  end = std::chrono::system_clock::now();
  elapsed_milli = std::chrono::duration_cast<std::chrono::microseconds>
                             (end-start).count();  
}

. Main function

using namespace std;
int main() {
  long long int N = 1073741825,i;
  int * ary = new int[N];
  srand (0);
  for (i = 0; i < N; i++) { ary[i] = rand()-RAND_MAX/2; }
  double sum = 0.0;
  sum = 0.0;
  long long int  elapsed_milli;
  cout <<"#"<<setw(19)<<"N"<<setw(20)<<"µs"<< endl;
  for(i=128; i<N; i=i*2)
  {
      test_intensive(i, ary, sum, elapsed_milli);
      //test_simple(i, ary, sum, elapsed_milli);
      cout << setw(20)<<i<<setw(20)<<elapsed_milli << setw(20)<<sum<<endl;
  }
}

Compile (using icpc)
Sequential (No OpenMP) version is compiled with :

icpc test_omp.cpp -O3 --std=c++0x

OpenMP (OpenMP) version is compiled with :

icpc test_omp.cpp -O3 --std=c++0x -openmp

Measurement
Time measurements are done with chrono using high_precision_clock and the limit precision on my machine is microseconds hence the use of std::chrono::microseconds (no point looking for higher precision)

Graph for the simple operation (axis are in log scale !)

Graph for the complex operation (axis are in log scale !)

Conclusions drawn

There is an offset the first time using OpenMP (the first #pragma omp crossed) because the pool thread must be set in place.
If we take a closer look at the 'intensive case' the first time we enter the test_ function (with i=128) the time cost is way higher in the OpenMP case than the in No OpenMP case. At the second call (with i=256) we dont see the benefit of using OpenMP but the timings are coherent.
We can see that we do not observe scalability with a small number of samples. It's clearer in the simple test case. In other words the amount of operations inside a parallel section must be high enough to render the time needed for thread pool management negligable. Otherwise there is no point dividing the operation into threads.
In this case (with the processor I used) the minimum number of samples is around 100000. But if I were to use 256 threads it would surely be around 6000000.
However for more CPU intensive operations using OpenMP can induce speed up even with 1000 samples (with the processor I used)

Summary

If you bench an OpenMP code try to set up the pool thread beforehand with a simple operation with #pragma omp parallel. In your test case the setting up takes most of the time.
Using OpenMP is a catch only if you parallelize sufficently CPU intensive functions (which is not really the case of a simple array sum...). For example this is the reason why in nested loops the #pragma omp for should always be in the outermost "possible" loop.

A single critical region outside of the loop can't really explain the timing difference, so I doubt the reduction fixes. — Max Langhof, Oct 16 '18 at 17:32
Thank you @MaxLanghof. In addition, this is a minimal program of what I am actually working on. I need to figure out the problem to continue my work. That is, reduction is not all I need. — mmotamedi144, Oct 16 '18 at 17:34
I have edited my answer because the first one was just a recommendation. Sorry about that @MaxLanghof — PilouPili, Oct 17 '18 at 16:29

score 1 · Accepted Answer · answered Oct 16 '18 at 18:52

As Max Langhof and user463035818 suggested, the program is memory bounded. I changed the program to do something more than accumulation. That is, I changed sum += ary[i] to sum += (pow(ary[i], 1.1) + pow(ary[i], 1.2)) / 100000000.0 and performed the same change in the parallel program and measured the time. The parallel program became 2X faster. If the program is IO bounded, I guess there is not much I can do about it to make it faster with OpenMP. Please let me know if you think otherwise.

Why OpenMP is slower than a sequential program for a simple reduction?

3 Answers3