7

The other day I ran into a weird problem using GCC and the '-Ofast' optimization flag. Compiling the below program using 'gcc -Ofast -o fib1 fib1.c'.

#include <stdio.h>

int f1(int n) {
    if (n < 2) {
        return n;
    }
    int a, b;
    a = f1(n - 1);
    b = f1(n - 2);
    return a + b;
}

int main(){
    printf("%d", f1(40));
}

When measuring execution time, the result is:

peter@host ~ $ time ./fib1
102334155
real    0m0.511s
user    0m0.510s
sys     0m0.000s

Now let's introduce a global variable in our program and compile again using 'gcc -Ofast -o fib2 fib2.c'.

#include <stdio.h>

int global;

int f1(int n) {
    if (n < 2) {
        return n;
    }
    int a, b;
    a = f1(n - 1);
    b = f1(n - 2);

    global = 0;

    return a + b;
}

int main(){
    printf("%d", f1(40));
}

Now the execution time is:

peter@host ~ $ time ./fib2
102334155
real    0m0.265s
user    0m0.265s
sys     0m0.000s

The new global variable does not do anything meaningful. However, the difference in execution time is considerable.

Apart from the question (1) what the reason is for such behavior, it also would be nice if (2) the last performance could be achieved without introducing meaningless variables. Any suggestions?

Thanks Peter

Peter
  • 187
  • 1
  • 5
  • 7
    First thing I would suspect is fishy benchmarking methods. – Lundin Dec 23 '15 at 13:09
  • Have you ried less agressive optimization levels, like `-O3` or `-O2`? – Ilya Popov Dec 23 '15 at 13:09
  • 2
    One measurement is not enough for forming a statistic – StoryTeller - Unslander Monica Dec 23 '15 at 13:09
  • 3
    I think you need to be more rigorous with your benchmarking. A still-back-of-the-envelope but easy improvement would be to modify the code so that it would repeatedly execute the measured code in a loop. have the loop execute the code a million times in that loop and time compare THOSE results. Also get rid of the printf statement. – Ken Clement Dec 23 '15 at 13:13
  • In addition of what has already been said, you can inspect the generated assemble code. – Jabberwocky Dec 23 '15 at 13:17
  • 3
    @KenClement: Please do wait a half a million seconds, and tell us the results. – Karoly Horvath Dec 23 '15 at 13:18
  • I confirmed the difference on my machine using GCC 4.9. The generated assembly code is completely different (I get 359 lines of assembly without the global, and 676 lines with it) so it's hard to tell what exactly changed here. – interjay Dec 23 '15 at 13:21
  • I suspect it's because when you're compiling your code, there's no way the compiler at that time knows that the global is never used. That would be optimized away at link time. What's the result if you use `-flto`? (I'm no expert in this, so that might not be complete) See https://gcc.gnu.org/onlinedocs/gccint/LTO-Overview.html – Andrew Henle Dec 23 '15 at 13:21
  • With `-O2`, the results switch places. – David Schwartz Dec 23 '15 at 13:24
  • @Karoly Horvath, what exactly is your point above? – Ken Clement Dec 23 '15 at 13:33
  • Multiple runs show similar results. So, for 'fib1' I can see an average of 0.5 seconds while multiple runs of 'fib2' show an average of 0.26 seconds. Overall, the second program is twice as fast as the first. – Peter Dec 23 '15 at 13:35
  • The results are very fragile. Changing tiny little things changes them drastically. For example, change the order of the `a =` and `b =` lines and the effect vanishes. – David Schwartz Dec 23 '15 at 13:38
  • 3
    @KenClement His point is that running code that takes half a second to run a million times is not feasible. The code already essentially runs repeatedly due to being recursive. – interjay Dec 23 '15 at 13:38
  • @interjay, The reason for removing the I/O and running the function a large number of times with a fixed argument is to remove or at least greatly reduce sources of random noise and bias in the timings that may not be obvious going in. Relying only on the recursion to achieve this introduces other possible biases in the result which is eliminated by a brute force execution of the function itself. Vary the argument to the function as necessary (I used 10). Sorry, I assumed this was self-evident. – Ken Clement Dec 23 '15 at 14:29
  • 1
    http://stackoverflow.com/q/32974625/1918193 (uses an asm instead of a global, but the result is the same). – Marc Glisse Dec 23 '15 at 15:27
  • 1
    @MarcGlisse a very good link. It actually answers my question (2). After using the compiler flag '-fno-optimize-sibling-calls' both programs show the same execution time (second one slightly slower, as expected). Thanks! – Peter Dec 23 '15 at 21:07

3 Answers3

1

I believe you hit some very clever and very weird gcc (mis-?)optimization. That's about as far as I got in researching this.

I modified your code to have an #ifdef G around the global:

$ cc -O3 -o foo foo.c && time ./foo
102334155

real    0m0.634s
user    0m0.631s
sys     0m0.001s
$ cc -O3 -DG -o foo foo.c && time ./foo
102334155

real    0m0.365s
user    0m0.362s
sys     0m0.001s

So I have the same weird performance difference.

When in doubt, read the generated assembler.

$ cc -S -O3 -o foo.s -S foo.c
$ cc -S -DG -O3 -o foog.s -S foo.c

Here it gets truly bizarre. Normally I can follow gcc-generated code pretty easily. The code that got generated here is just incomprehensible. What should be pretty straightforward recursion and addition that should fit in 15-20 instructions, gcc expanded to a several hundred instructions with a flurry of shifts, additions, subtractions, compares, branches and a large array on the stack. It looks like it tried to partially convert one or both recursions into an iteration and then unrolled that loop. One thing struck me though, the non-global function had only one recursive call to itself (the second one is the call from main):

$ grep 'call.*f1' foo.s | wc
      2       4      18

While the global one one had:

$ grep 'call.*f1' foog.s | wc
     33      66     297

My educated (I've seen this many times before) guess? Gcc tried to be clever and in its fervor the function that in theory should be easier to optimize generated worse code while the write to the global variable made it sufficiently confused that it couldn't optimize so hard that it led to better code. This happens all the time, many optimizations that gcc (and other compilers too, let's not single them out) uses are very specific to certain benchmarks they use and might not generate faster running code in many other cases. In fact, from experience I only ever use -O2 unless I've benchmarked things very carefully to see that -O3 makes sense. It very often doesn't.

If you really want to research this further, I'd recommend reading gcc documentation about which optimizations get enabled with -O3 as opposed to -O2 (-O2 doesn't do this), then try them one by one until you find which one causes this behavior and that optimization should be a pretty good hint for what's going on. I was about to do this, but I ran out of time (must do last minute christmas shopping).

Art
  • 19,807
  • 1
  • 34
  • 60
  • > -O2 doesn't do this Take a look at my answer, `-O2` just swaps `fib1` and `fib2`. Only `-O0` make them equal. And they must be equal in my opinion, because we can safely exclude global variable - it's value never got read. – Roman Zaitsev Dec 23 '15 at 15:31
  • You can ask gcc which optimizations are enabled with `gcc -Q -O3 --help=optimizers` or get a list of differences using `diff`. Possibly easier than sifting through the manual, and has the advantage of corresponding to the actual version of gcc you're using, in case your manual doesn't. – rici Dec 23 '15 at 15:39
  • Thanks Art. It's confusing and worrying that the second program is almost _twice_ as fast. One would expect that '-Ofast' would give more or less the same result for both programs. Actually, I'm tempted to look at this kind of optimization behavior as a bug. – Peter Dec 23 '15 at 18:43
  • @Peter It could be a bug. Or it could be just an optimization that works well in some cases that just doesn't work well here. I don't want to call it a bug, just a secret craft knowledge that gcc -O3 applies some dubious optimizations that sometimes make things much worse. – Art Dec 24 '15 at 01:37
  • @RomanZaytsev On the gcc version I used the only difference between -O2 and -O3 was one instruction writing to the global variable. I didn't want to trust the timing because the differences I saw were pretty close to noise. – Art Dec 24 '15 at 01:38
0

On my machine (gcc (Ubuntu 5.2.1-22ubuntu2) 5.2.1 20151010) I've got this:
time ./fib1 0,36s user 0,00s system 98% cpu 0,364 total
time ./fib2 0,20s user 0,00s system 98% cpu 0,208 total

From man gcc:

-Ofast
Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math and the Fortran-specific -fno-protect-parens and -fstack-arrays.

Not so safe option, let's try -O2:
time ./fib1 0,38s user 0,00s system 99% cpu 0,377 total
time ./fib2 0,47s user 0,00s system 99% cpu 0,470 total

I think, that some of aggressive optimizations weren't applied to fib1, but were applied to fib2. When I switched -Ofast for -O2 - some of optimizations weren't applied to fib2, but were applied to fib1.

Let's try -O0:
time ./fib1 0,81s user 0,00s system 99% cpu 0,812 total
time ./fib2 0,81s user 0,00s system 99% cpu 0,814 total

They are equal without optimizations.
So introducing global variable in recursive function can break some optimizations on one hand and improve other optimizations on other hand.

Roman Zaitsev
  • 1,328
  • 5
  • 20
  • 28
  • 1
    I get the same weird behavior with -O3, doesn't have to be -Ofast. – Art Dec 23 '15 at 14:25
  • Yes. I think @Peter can take a look at http://stackoverflow.com/questions/14737371/how-to-find-out-which-optimizations-are-actually-applied-when-using-gcc – Roman Zaitsev Dec 23 '15 at 14:28
0

This results from inline limits kicking in earlier in the second version. Because the version with the global variable does more. That strongly suggests that inlining makes run-time performance worse in this particular example.

Compile both versions with -Ofast -fno-inline and the difference in time is gone. In fact, the version without the global variable runs faster.

Alternatively, just mark the function with __attribute__((noinline)).

Maxim Egorushkin
  • 131,725
  • 17
  • 180
  • 271
  • 1
    Actually, it makes both programs perform worse, though indeed the one without the global variable is faster. But we know how fast the program can actually perform using a (meaningless) global variable. So the question remains, how can the original (best) speed be achieved without adding meaningless global variables? – Peter Dec 23 '15 at 18:33
  • @Peter In my benchmark the non-inline version performs faster. – Maxim Egorushkin Dec 24 '15 at 13:25