O3 optimization flag making speed-ups worse in parallel processing

Question

I am testing speed-ups for a parallel program in C using OpenMP. Using -O3 flag to compile the code with gcc, the execution time seems to be much smaller. However, I am consistently getting slower speed-ups for different thread numbers (2,4,8,16,24) when compared to the code compiled without optimization flags. How is this possible?

Here is more info about what I've found so far. I am writing a code for finding prime numbers based on the Sieve of Eratosthenes, and trying to optimize it with a parallel version using OpenMP. Here is the code

#include <stdio.h>
#include <stdlib.h>
#include <omp.h> 
#include <math.h> 

// ind2num: returns the integer (3<=odd<=numMax)
//      represented by index i at prime_numbers (0<=i<=maxInd)
#define ind2num(i)  (2*(i)+3)
// num2ind: retorns the index (0<=i<=maxInd) at prime_numbers
//      which represents the number (3<=odd<=numMax)
#define num2ind(i)  (((i)-3)/2)

// Sieve: find all prime numbers until ind2num(maxInd)
void Sieve(int *prime_numbers, long maxInd) {
    long maxSqrt;
    long baseInd;
    long base;
    long i;

    // square root of the largest integer (largest possible prime factor)
    maxSqrt = (long) sqrt((long) ind2num(maxInd));

    // first base
    baseInd=0;
    base=3;

    do {
        // marks as non-prime all multiples of base starting at base^2
        #pragma omp parallel for schedule (static)
        for (i=num2ind(base*base); i<=maxInd; i+=base) {
            prime_numbers[i]=0;
        }

        // updates base to next prime number
        for (baseInd=baseInd+1; baseInd<=maxInd; baseInd++)
            if (primos[baseInd]) {
                base = ind2num(baseInd);
                break;
            }
    }
    while (baseInd <= maxInd && base <= maxSqrt);

}

If I execute it to find all prime numbers smaller than 1000000000 (10^9), for example, I end up with the following execution times for different number of threads (1,2,4,8,16,24):

Without -O3 | 56.31s | 28.87s | 21.77s | 11.19s | 6.13s | 4.50s |

With -O3 .... | 10.10s | 5.23s | 3.74s | 2.81s | 2.62s | 2.52s |

Here are the corresponding speed-ups:

Without -O3 | 1 | 1.95 | 2.59 | 5.03 | 9.19 | 12.51 |

With -O3 .... | 1 | 1.93 | 2.70 | 3.59 | 3.85 | 4.01 |

How come I keep getting lower speed-ups with -O3 flag?

@user3386109, he's parallelizing the process of striking out multiples of a given prime from the sieve (look for the `#pragma omp`). — John Bollinger, Aug 19 '16 at 17:53
How are you calculating the times and what do the numbers represent (e.g. diff of `clock_gettime`)? It looks like (e.g.) for 24, without `-O3` is 4.5s and with is 2.52s (i.e. _with_ is ~2x faster). Have you verified that _results_ are the same for all # of threads? That is, no race condition in modifying `primos`. Also, beyond ~4 threads, you could saturate the memory bus with accesses to `primos` and that's the limiting factor, regardless of number of cores. — Craig Estey, Aug 19 '16 at 17:53
For every number of threads, the run time of the `-O3` optimized code is lower than the runtime of the default-optimized code. That's what you should be looking at. There is no reason to assume that the speedup with increasing threadcount should be the same for one case as for the other -- you're comparing the speedups of *different programs*. — John Bollinger, Aug 19 '16 at 17:57
@JohnBollinger Yes, you are correct. Seems similar to [this question](https://stackoverflow.com/questions/25624714/openmp-embarrassingly-parallel-for-loop-no-speedup). — user3386109, Aug 19 '16 at 17:58
@user3386109, yes, there is some similarity, although in this case the parallelization does deliver some speedup, especially for small numbers of threads. But the question here is quite different: not about the speedup with threadcount, but about the differential effect of various optimization levels on the speedup. — John Bollinger, Aug 19 '16 at 18:04
@JohnBollinger I'm surprised that there's any speedup at all. The loop that's being parallelized consists of four instructions: store, add, compare, branch. How does that not saturate the memory interface with 4 threads? Answer: the OP's machine isn't just a standard 4-core desktop. Which is to say that the OP needs to provide information about the machine before the results can be analyzed. — user3386109, Aug 19 '16 at 18:15
@user3386109: Nice catch! I'm running on a Intel Xeon 2695 with 12 cores - 24 threads. — Gabriel Ilharco, Aug 19 '16 at 18:23
@JohnBollinger: It seems to me that -O3 is somehow making my code less prone to parallel execution. What does -O3 do to possible affect that? Isn't -O3 supposed to be just unrolling the loop? — Gabriel Ilharco, Aug 19 '16 at 18:25
@GabrielIlharco, details of the optimizations performed at each level vary with the version of GCC, but I think it's safe to say that `-O3` adds many more optimizations than just loop unrolling to those performed by the default level (`-O` == `-O1`) in every version available. For that matter, `-O3` does not include loop unrolling at all in my version of GCC -- you need to request it explicitly if you want it. — John Bollinger, Aug 19 '16 at 18:31
@GabrielIlharco, but that's beside the point. I don't see why it's hard to believe that a more highly-optimized program affords less opportunity for parallelism to be effective. It may be that the balance of the serial vs. parallelizable parts is changed, or that the effect of a system bottleneck -- such as the memory bus -- is enhanced. — John Bollinger, Aug 19 '16 at 18:37
@JohnBollinger: It does make a lot of sense. Assuming unrolling the loops (or other optimizations) make the parallelizable part of the code faster, it shouldn't be a surprise if the parallelizable part of the code represents, after optimizing, a smaller fraction of execution time. Thus, by Amdahl's law, it makes sense that improving the execution time of that part of the code by parallelization should have a smaller effect on total execution time. — Gabriel Ilharco, Aug 19 '16 at 19:02
You should compile with `-O0` and downclock the CPU cores if you'd like to get the best scaling possible. — Hristo Iliev, Aug 19 '16 at 20:14
As Hristo said the slowest configuration normally gives best parallel scaling — tim18, Aug 20 '16 at 02:17

score 4 · Accepted Answer · edited Aug 20 '16 at 22:47

4

There's a certain amount of memory bandwidth that the execution of the algorithm is going to require. The less optimized the code, the more internal CPU machinations dominate the run time. The more optimized the code, the more memory speed dominates the run time.

Since unoptimized code is less efficient, more cores can run it before the system memory bandwidth gets saturated. Since the optimized code is more efficient, it gets its memory accesses done faster and thus puts a heavier load on the system memory bandwidth. This makes it less parallelizable.

edited Aug 20 '16 at 22:47

NoseKnowsAll

4,593
2
23
44

answered Aug 19 '16 at 18:51

David Schwartz

179,497
17
214
278

6

I would rather say, in that particular case the scalability is limited by the OpenMP overhead and the fact that only a portion of code is parallel (Amdahl's law). Compiling without optimisations increases the parallel time, thus reducing the percentage of the overhead+serial part. – Hristo Iliev Aug 19 '16 at 20:19

O3 optimization flag making speed-ups worse in parallel processing

1 Answers1