No speedup with OpenMP

Question

I am working with OpenMP in order to obtain an algorithm with a near-linear speedup. Unfortunately I noticed that I could not get the desired speedup.

So, in order to understand the error in my code, I wrote another code, an easy one, just to double-check that the speedup was in principle obtainable on my hardware.

This is the toy example i wrote:

#include <omp.h>
#include <cmath>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <cstdlib>
#include <fstream>
#include <sstream>
#include <iomanip>
#include <iostream>
#include <stdexcept>
#include <algorithm>
#include "mkl.h"

int main () {
      int number_of_threads = 1;
      int n = 600;
      int m = 50;
      int N = n/number_of_threads;
      int time_limit = 600;
      double total_clock = omp_get_wtime();
      int time_flag = 0;

      #pragma omp parallel num_threads(number_of_threads)
       {
          int thread_id = omp_get_thread_num();
          int iteration_number_local = 0;
          double *C = new double[n]; std::fill(C, C+n, 3.0);
          double *D = new double[n]; std::fill(D, D+n, 3.0);
          double *CD = new double[n]; std::fill(CD, CD+n, 0.0);

          while (time_flag == 0){
                for (int i = 0; i < N; i++)                     
                    for(int z = 0; z < m; z++)
                        for(int x = 0; x < n; x++)
                            for(int c = 0; c < n; c++){
                                CD[c] = C[z]*D[x];
                                C[z] = CD[c] + D[x];
                            }
                iteration_number_local++;
                if ((omp_get_wtime() - total_clock) >= time_limit) 
                    time_flag = 1; 
           }
       #pragma omp critical
       std::cout<<"I am "<<thread_id<<" and I got" <<iteration_number_local<<"iterations."<<std::endl;
       }
    }

I want to highlight again that this code is only a toy-example to try to see the speedup: the first for-cycle becomes shorter when the number of parallel threads increases (since N decreases).

However, when I go from 1 to 2-4 threads the number of iterations double up as expected; but this is not the case when I use 8-10-20 threads: the number of iterations does not increase linearly with the number of threads.

Could you please help me with this? Is the code correct? Should I expect a near-linear speedup?

Results

Running the code above I got the following results.

1 thread: 23 iterations.

20 threads: 397-401 iterations per thread (instead of 420-460).

What hardware are you running on? Please be specific about the processor(s) and memory. What compiler version and options and what operating system? How many iterations do you observe? — Zulan, Jul 28 '16 at 10:30
Problematic issues in your measurement: `CD` is never used so the compiler could just optimize everything you expect to be expensive away. You should at least output all `iteration_number_local` (use `pragma omp critical`). — Zulan, Jul 28 '16 at 10:32
I am running the code on an hardware with two 10-Core Intel Xeon-E5 (so, I have 20 cores in total) with 256GB RAM. The operating system is Linux. I do not know about the compiler: I load a module called "gsl 1.15", while the cmake call a compiler called "icc". I think this is not what you asked, please clarify better to me. I run some quick simulations with n= 1000, m =200. With 1thread I get 3 iterations in 120 seconds. With 2 threads I get 5 iterations per thread (instead of 6). With 20 threads I get between 40 and 44 iterations per thread (instead of 60!). — Mobius88, Jul 28 '16 at 11:16

score 1 · Answer 1 · answered Jul 28 '16 at 13:23

Your measurement methodology is wrong. Especially for small number of iterations.

1 thread: 3 iterations.

3 reported iterations actually means that 2 iterations finished in less than 120 s. The third one took longer. The time of 1 iteration is between 40 and 60 s.

2 threads: 5 iterations per thread (instead of 6).

4 iterations finished in less than 120 s. The time of 1 iteration is between 24 and 30 s.

20 threads: 40-44 iterations per thread (instead of 60).

40 iterations finished in less than 120 s. The time of 1 iteration is between 2.9 and 3 s.

As you can see your results actually do not contradict linear speedup.

It would be much simpler and accurate to simply execute and time one single outer loop and you will likely see almost perfect linear speedup.

Some reasons (non exhaustive) why you don't see linear speedup are:

Memory bound performance. Not the case in your toy example with n = 1000. More general speaking: contention for a shared resource (main memory, caches, I/O).
Synchronization between threads (e.g. critical sections). Not the case in your toy example.
Load imbalance between threads. Not the case in your toy example.
Turbo mode will use lower frequencies when all cores are utilized. This can happen in your toy example.

From your toy example I would say that your approach to OpenMP can be improved by better using the high level abstractions, e.g. for.

More general advise would be too broad for this format and require more specific information about the non-toy example.

I agree with this your answer, so I run longer simulations. I run a code like the one above, by including another instructions inside the nested loop: C[z] = CD[c] + D[x]; in order to use CD, as you suggested. By setting n=600 and m=50 I got 23 iter with one thread in 600 seconds and 400 iterations per thread with 20 threads in 600 seconds. It is not the expected speedup. Am I right? — Mobius88, Jul 28 '16 at 13:59
399 / 22 is pretty close to 20 x speedup. Close enough to be perfectly acceptable as a near-linear speedup in a real application. Also easily explainable by turbo mode or even just variance. — Zulan, Jul 28 '16 at 14:17
As the arrays are small enough to fit into the L1 cache, it is probably the frequency scaling. — Hristo Iliev, Jul 29 '16 at 09:01

score 0 · Answer 2 · answered Jul 28 '16 at 09:44

0

You make some declaration inside the parallel region which means you will allocate the memorie and fill it number_of_threads times. Instead I recommand you :

double *C = new double[n]; std::fill(C, C+n, 3.0);
double *D = new double[n]; std::fill(D, D+n, 3.0);
double *CD = new double[n]; std::fill(CD, CD+n, 0.0);
#pragma omp parallel firstprivate(C,D,CD) num_threads(number_of_threads)
   {
      int thread_id = omp_get_thread_num();
      int iteration_number_local = 0;  
   }

Your hardware have a limited quantity of threads which depends of the number of core of your processor. You may have 2 or 4 core.

A parallel region doesn't speed up your code. With open openMP you should use #omp parallel for to speed up for loop or

#pragma omp parallel
{
  #pragma omp for
  {
  }
}

this notation is equivalent to #pragma omp parallel for. It will use several threads (depend on you hardware) to proceed the for loop faster. be careful

#pragma omp parallel
{
  for
  {
  }
}

will make the entire for loop for each thread, which will not speed up your program.

answered Jul 28 '16 at 09:44

Pierre Guilbert

39
2

I am not sure you are right. The construct #pragma omp parallel permits to the number of threads that I require to execute all the commands inside the block independently. So, each thread will execute the nested-for inside the parallel block. You can see that the first for-cycle becomes shorter when the number of threads increases, so it there should be a linear speedup. I am running this code on a cluster computer with 20 cores per node. – Mobius88 Jul 28 '16 at 09:53
Of course the first for-cycle becomes shorter, but you will do it number_of_threads times. So finally, you will make n/number_of_threads*number_of_threads operations. – Pierre Guilbert Jul 28 '16 at 10:00
#pragma omp parallel { #pragma omp for { } } and #pragma omp parallel { for { } } are not the same instruction – Pierre Guilbert Jul 28 '16 at 10:00
Maybe I am missing something. Example: If I run the code with 1 thread only, it will execute the first for-cycle let's say 10 times in 120 seconds. Now if I run the same code with 4 thread I would expect that each thread runs the same code where the for-cycle becomes 4 times shorter (because N is divided by 4). So I wil have 4 threads running the same block, where the for-cycle is 4 time shorter with respect to the 1-thread situation. So each thread will execute the first for-cycle 40 times in 120 seconds. – Mobius88 Jul 28 '16 at 10:02
I got it now and I agree. Since the threads are running a for-cycle 4 times shorter you should have a speed up reduction. Maybe some cores of the cluster are very slow and ruined your speed up. But I think you should try an example more simple with just one #pragma omp parallel instruction and see if you have a linear speed up – Pierre Guilbert Jul 28 '16 at 10:09

score 0 · Answer 3 · answered Jul 28 '16 at 10:02

You should try

   #pragma omp parallel num_threads(number_of_threads)
   {
      int thread_id = omp_get_thread_num();
      int iteration_number_local = 0;
      double *C = new double[n]; std::fill(C, C+n, 3.0);
      double *D = new double[n]; std::fill(D, D+n, 3.0);
      double *CD = new double[n]; std::fill(CD, CD+n, 0.0);

      while (time_flag == 0){
           #pragma omp for 
            for (int i = 0; i < N; i++)                     
                for(int z = 0; z < m; z++)
                    for(int x = 0; x < n; x++)
                        for(int c = 0; c < n; c++)
                            CD[c] = C[z]*D[x];
            iteration_number_local++;
            if ((omp_get_wtime() - total_clock) >= time_limit) 
                time_flag = 1; 
       }
       if(thread_id == 0)
        iteration_number = iteration_number_local;
   }
  std::cout<<"Iterations= "<<iteration_number<<std::endl;
}

You should edit your previous answer instead of posting a new one. — Hristo Iliev, Jul 29 '16 at 09:01

No speedup with OpenMP

3 Answers3

Linked