1

I have implemented knapsack using OpenMP (gcc version 4.6.3)

#define MAX(x,y)   ((x)>(y) ? (x) : (y))
#define table(i,j)    table[(i)*(C+1)+(j)]

   for(i=1; i<=N; ++i) {
#pragma omp parallel for
      for(j=1; j<=C; ++j) {
         if(weights[i]>j) {
            table(i,j) = table(i-1,j);
         }else {
            table(i,j) = MAX(profits[i]+table(i-1,j-weights[i]), table(i-1,j));
         }
      }
   }

execution time for the sequential program = 1s

execution time for the openmp with 1 thread = 1.7s (overhead = 40%)

Used the same compiler optimization flags (-O3) in the both cases.

Can someone explain the reason behind this behavior.

Thanks.

warunapww
  • 966
  • 4
  • 18
  • 38
  • 1
    Might very well be OpenMP overhead. Actually parallel code can never be faster than sequential code when using a single processor. BTW are you sure that your measurings are correct and significant to what you try to do? – Jonas Bötel Feb 25 '14 at 05:06
  • 1s is actually a long time so my first guess is the overhead would be insignificant. What compiler options you used. Did you tell the compiler to optimize the code (e.g. with -O3, -O2, or -Ofast)? – Z boson Feb 25 '14 at 08:22
  • I even tried for runtimes of 15s still I can see 3s of overhead. I was using -O3 for both. – warunapww Feb 25 '14 at 20:17

1 Answers1

8

Enabling OpenMP inhibits certain compiler optimisations, e.g. it could prevent loops from being vectorised or shared variables from being kept in registers. Therefore OpenMP-enabled code is usually slower than the serial and one has to utilise the available parallelism to offset this.

That being said, your code contains a parallel region nested inside the outer loop. This means that the overhead of entering and exiting the parallel region is multiplied N times. This only makes sense if N is relatively small and C is significantly larger (like orders of magnitude larger) than N, therefore the work being done inside the region greatly outweighs the OpenMP overhead.

Hristo Iliev
  • 72,659
  • 12
  • 135
  • 186
  • Exactly, that's my thoughts as well. Assuming N is larger enough then the overhead should be insignificant. I would also be surprised if compiler optimizations made so much of a difference (1s vs 1.7s). That's why I'm wondering if optimization was even used and if it was it would be interesting to compare the assembly output with and without OpenMP. – Z boson Feb 25 '14 at 12:03
  • 1
    @Zboson, `i` is a shared variable in the parallel region. GCC converts any instance of `i` in the region into `omp_arg->i` where `omp_arg` is a structure that contains copies of all shared variables in the region and a pointer to it gets passed as an argument to the function that implements the region. When compiled as serial, `i` is most likely kept in a register the whole time. Therefore making `i` `firstprivate` should improve the performance to a certain extent. – Hristo Iliev Feb 25 '14 at 13:15
  • Oh, I see your point. I thought since i is not changed in the parallel region that it would not matter if it was declared private or shared but I see now that it might effect optimization. – Z boson Feb 25 '14 at 13:32
  • 1
    One can use `-O3 -fdump-tree-all` and examine the `.c.126t.final_cleanup` file and see how different the code is. In the serial version `i` is eliminated altogether while in the shared version `i` is reloaded every time. – Hristo Iliev Feb 25 '14 at 17:18
  • @Hristo N - 960, C - 6680000. For this case, sequential time is 16s and OpenMP with 1 thread is 19s. Therefore the overhead is about 20%. I'd like to know whether this is normal. Thanks – warunapww Feb 25 '14 at 22:06
  • Add the `firstprivate(i,C)` clause and see how it changes the timings. – Hristo Iliev Feb 25 '14 at 22:31
  • @warunapww, how much did `firstprivate(i,C)` improve the performance? – Z boson Feb 26 '14 at 08:06