509

I first noticed in 2009 that GCC (at least on my projects and on my machines) have the tendency to generate noticeably faster code if I optimize for size (-Os) instead of speed (-O2 or -O3), and I have been wondering ever since why.

I have managed to create (rather silly) code that shows this surprising behavior and is sufficiently small to be posted here.

const int LOOP_BOUND = 200000000;

__attribute__((noinline))
static int add(const int& x, const int& y) {
    return x + y;
}

__attribute__((noinline))
static int work(int xval, int yval) {
    int sum(0);
    for (int i=0; i<LOOP_BOUND; ++i) {
        int x(xval+sum);
        int y(yval+sum);
        int z = add(x, y);
        sum += z;
    }
    return sum;
}

int main(int , char* argv[]) {
    int result = work(*argv[1], *argv[2]);
    return result;
}

If I compile it with -Os, it takes 0.38 s to execute this program, and 0.44 s if it is compiled with -O2 or -O3. These times are obtained consistently and with practically no noise (gcc 4.7.2, x86_64 GNU/Linux, Intel Core i5-3320M).

(Update: I have moved all assembly code to GitHub: They made the post bloated and apparently add very little value to the questions as the fno-align-* flags have the same effect.)

Here is the generated assembly with -Os and -O2.

Unfortunately, my understanding of assembly is very limited, so I have no idea whether what I did next was correct: I grabbed the assembly for -O2 and merged all its differences into the assembly for -Os except the .p2align lines, result here. This code still runs in 0.38s and the only difference is the .p2align stuff.

If I guess correctly, these are paddings for stack alignment. According to Why does GCC pad functions with NOPs? it is done in the hope that the code will run faster, but apparently this optimization backfired in my case.

Is it the padding that is the culprit in this case? Why and how?

The noise it makes pretty much makes timing micro-optimizations impossible.

How can I make sure that such accidental lucky / unlucky alignments are not interfering when I do micro-optimizations (unrelated to stack alignment) on C or C++ source code?


UPDATE:

Following Pascal Cuoq's answer I tinkered a little bit with the alignments. By passing -O2 -fno-align-functions -fno-align-loops to gcc, all .p2align are gone from the assembly and the generated executable runs in 0.38s. According to the gcc documentation:

-Os enables all -O2 optimizations [but] -Os disables the following optimization flags:

  -falign-functions  -falign-jumps  -falign-loops
  -falign-labels  -freorder-blocks  -freorder-blocks-and-partition
  -fprefetch-loop-arrays

So, it pretty much seems like a (mis)alignment issue.

I am still skeptical about -march=native as suggested in Marat Dukhan's answer. I am not convinced that it isn't just interfering with this (mis)alignment issue; it has absolutely no effect on my machine. (Nevertheless, I upvoted his answer.)


UPDATE 2:

We can take -Os out of the picture. The following times are obtained by compiling with

  • -O2 -fno-omit-frame-pointer 0.37s

  • -O2 -fno-align-functions -fno-align-loops 0.37s

  • -S -O2 then manually moving the assembly of add() after work() 0.37s

  • -O2 0.44s

It looks like to me the distance of add() from the call site matters a lot. I have tried perf, but the output of perf stat and perf report makes very little sense to me. However, I could only get one consistent result out of it:

-O2:

 602,312,864 stalled-cycles-frontend   #    0.00% frontend cycles idle
       3,318 cache-misses
 0.432703993 seconds time elapsed
 [...]
 81.23%  a.out  a.out              [.] work(int, int)
 18.50%  a.out  a.out              [.] add(int const&, int const&) [clone .isra.0]
 [...]
       ¦   __attribute__((noinline))
       ¦   static int add(const int& x, const int& y) {
       ¦       return x + y;
100.00 ¦     lea    (%rdi,%rsi,1),%eax
       ¦   }
       ¦   ? retq
[...]
       ¦            int z = add(x, y);
  1.93 ¦    ? callq  add(int const&, int const&) [clone .isra.0]
       ¦            sum += z;
 79.79 ¦      add    %eax,%ebx

For fno-align-*:

 604,072,552 stalled-cycles-frontend   #    0.00% frontend cycles idle
       9,508 cache-misses
 0.375681928 seconds time elapsed
 [...]
 82.58%  a.out  a.out              [.] work(int, int)
 16.83%  a.out  a.out              [.] add(int const&, int const&) [clone .isra.0]
 [...]
       ¦   __attribute__((noinline))
       ¦   static int add(const int& x, const int& y) {
       ¦       return x + y;
 51.59 ¦     lea    (%rdi,%rsi,1),%eax
       ¦   }
[...]
       ¦    __attribute__((noinline))
       ¦    static int work(int xval, int yval) {
       ¦        int sum(0);
       ¦        for (int i=0; i<LOOP_BOUND; ++i) {
       ¦            int x(xval+sum);
  8.20 ¦      lea    0x0(%r13,%rbx,1),%edi
       ¦            int y(yval+sum);
       ¦            int z = add(x, y);
 35.34 ¦    ? callq  add(int const&, int const&) [clone .isra.0]
       ¦            sum += z;
 39.48 ¦      add    %eax,%ebx
       ¦    }

For -fno-omit-frame-pointer:

 404,625,639 stalled-cycles-frontend   #    0.00% frontend cycles idle
      10,514 cache-misses
 0.375445137 seconds time elapsed
 [...]
 75.35%  a.out  a.out              [.] add(int const&, int const&) [clone .isra.0]                                                                                     ¦
 24.46%  a.out  a.out              [.] work(int, int)
 [...]
       ¦   __attribute__((noinline))
       ¦   static int add(const int& x, const int& y) {
 18.67 ¦     push   %rbp
       ¦       return x + y;
 18.49 ¦     lea    (%rdi,%rsi,1),%eax
       ¦   const int LOOP_BOUND = 200000000;
       ¦
       ¦   __attribute__((noinline))
       ¦   static int add(const int& x, const int& y) {
       ¦     mov    %rsp,%rbp
       ¦       return x + y;
       ¦   }
 12.71 ¦     pop    %rbp
       ¦   ? retq
 [...]
       ¦            int z = add(x, y);
       ¦    ? callq  add(int const&, int const&) [clone .isra.0]
       ¦            sum += z;
 29.83 ¦      add    %eax,%ebx

It looks like we are stalling on the call to add() in the slow case.

I have examined everything that perf -e can spit out on my machine; not just the stats that are given above.

For the same executable, the stalled-cycles-frontend shows linear correlation with the execution time; I did not notice anything else that would correlate so clearly. (Comparing stalled-cycles-frontend for different executables doesn't make sense to me.)

I included the cache misses as it came up as the first comment. I examined all the cache misses that can be measured on my machine by perf, not just the ones given above. The cache misses are very very noisy and show little to no correlation with the execution times.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Ali
  • 56,466
  • 29
  • 168
  • 265
  • 44
    Blind guess: can this be a cache miss? –  Oct 19 '13 at 20:46
  • @H2CO3 That was my 1st thought as well, but wasn't encouraged enough to post the comment without reading and understanding the OP's question in depth. – πάντα ῥεῖ Oct 19 '13 at 20:48
  • 2
    @g-makulik That's why I warned that it's a "blind guess" ;-) "TL;DR" is reserved for bad questions. :P –  Oct 19 '13 at 20:52
  • 3
    Just an interesting data point: I find that -O3 or -Ofast is about 1.5x as fast as -Os when I compile this with clang on OS X. (I haven't tried reproducing with gcc.) – Rob Napier Oct 19 '13 at 21:27
  • 2
    It is the same code. Take a closer look at the address of .L3, misaligned branch targets are expensive. – Hans Passant Oct 19 '13 at 21:33
  • @HansPassant If I interpret your comment correctly: That is why it is surprising; it is padded with `-O2` and yet it is slower. Or if I misinterpret, then please elaborate on your comment. – Ali Oct 19 '13 at 21:55
  • @RobNapier Yes, I also tested clang but this question is already longer than I wanted so I simply left that out. By the way, clang generates code that runs in 0.38s if I use `-O3` but with `-O2` or `-Os` it runs significantly slower. – Ali Oct 19 '13 at 21:59
  • @H2CO3 Not all cache misses can be measured on my machine. Stalling on a function call can mean an instruction cache miss. However, counting instruction cache misses is limited on my machine, many of them are not supported. :( – Ali Oct 20 '13 at 14:13
  • Responding to [this comment](http://stackoverflow.com/questions/15665433/c-mysteriously-huge-speedup-from-keeping-one-operand-in-a-register/15665780#comment28888970_15665780): I saw this yesterday (and got my upvote), but I haven't had the time to properly look at it yet. At a glance, Marat Dukhan's answer is pretty in-depth already and is probably what I've would've done if I actually sat down and tried to figure it out myself. – Mysticial Oct 20 '13 at 15:31
  • Why the anonymous down votes? It is not helping anybody; please leave a comment at least. – Ali Feb 23 '16 at 15:39
  • @Ali Maybe their mouse positioning was a little bit buggy... :-( – peterh Jul 22 '16 at 04:33
  • @jens: The software is GCC. That part of the software is a front-end whose binary is named `g++` is neither here nor there; furthermore, I see no evidence suggesting that the OP was using `g++`. Ultimately the choice of front-end isn't really relevant anyway. – Lightness Races in Orbit Mar 01 '18 at 17:53
  • @LightnessRacesinOrbit I respectfully disagree. The evidence is `const int& y` which is C++ and invalid in C. I also object to your revert of my clarification. `gcc` is the compiler binary, while `GCC` is the GNU Compiler Collection. Your revert reinforces the confusion. The OP has clearly not used gcc to compile the code, but g++. Please reconsider your revert with this in mind. Thank you. – Jens Mar 02 '18 at 17:06
  • @Jens: Fair point regarding `const int& y` - I missed that. However, the question is still about GCC, not about the `g++` command-line wrapper. The proper way to do it would be to write neither "gcc" nor "g++", but "GCC". "gcc" is at incorrect only in its Englishness (lack of proper handling for acronyms). I've just made that change. Would you accept the post as it is now? – Lightness Races in Orbit Mar 02 '18 at 17:26
  • @LightnessRacesinOrbit Very well. What about s/gcc/GCC in the title? – Jens Mar 02 '18 at 17:28

6 Answers6

579

By default compilers optimize for "average" processor. Since different processors favor different instruction sequences, compiler optimizations enabled by -O2 might benefit average processor, but decrease performance on your particular processor (and the same applies to -Os). If you try the same example on different processors, you will find that on some of them benefit from -O2 while other are more favorable to -Os optimizations.

Here are the results for time ./test 0 0 on several processors (user time reported):

Processor (System-on-Chip)             Compiler   Time (-O2)  Time (-Os)  Fastest
AMD Opteron 8350                       gcc-4.8.1    0.704s      0.896s      -O2
AMD FX-6300                            gcc-4.8.1    0.392s      0.340s      -Os
AMD E2-1800                            gcc-4.7.2    0.740s      0.832s      -O2
Intel Xeon E5405                       gcc-4.8.1    0.603s      0.804s      -O2
Intel Xeon E5-2603                     gcc-4.4.7    1.121s      1.122s       -
Intel Core i3-3217U                    gcc-4.6.4    0.709s      0.709s       -
Intel Core i3-3217U                    gcc-4.7.3    0.708s      0.822s      -O2
Intel Core i3-3217U                    gcc-4.8.1    0.708s      0.944s      -O2
Intel Core i7-4770K                    gcc-4.8.1    0.296s      0.288s      -Os
Intel Atom 330                         gcc-4.8.1    2.003s      2.007s      -O2
ARM 1176JZF-S (Broadcom BCM2835)       gcc-4.6.3    3.470s      3.480s      -O2
ARM Cortex-A8 (TI OMAP DM3730)         gcc-4.6.3    2.727s      2.727s       -
ARM Cortex-A9 (TI OMAP 4460)           gcc-4.6.3    1.648s      1.648s       -
ARM Cortex-A9 (Samsung Exynos 4412)    gcc-4.6.3    1.250s      1.250s       -
ARM Cortex-A15 (Samsung Exynos 5250)   gcc-4.7.2    0.700s      0.700s       -
Qualcomm Snapdragon APQ8060A           gcc-4.8       1.53s       1.52s      -Os

In some cases you can alleviate the effect of disadvantageous optimizations by asking gcc to optimize for your particular processor (using options -mtune=native or -march=native):

Processor            Compiler   Time (-O2 -mtune=native) Time (-Os -mtune=native)
AMD FX-6300          gcc-4.8.1         0.340s                   0.340s
AMD E2-1800          gcc-4.7.2         0.740s                   0.832s
Intel Xeon E5405     gcc-4.8.1         0.603s                   0.803s
Intel Core i7-4770K  gcc-4.8.1         0.296s                   0.288s

Update: on Ivy Bridge-based Core i3 three versions of gcc (4.6.4, 4.7.3, and 4.8.1) produce binaries with significantly different performance, but the assembly code has only subtle variations. So far, I have no explanation of this fact.

Assembly from gcc-4.6.4 -Os (executes in 0.709 secs):

00000000004004d2 <_ZL3addRKiS0_.isra.0>:
  4004d2:       8d 04 37                lea    eax,[rdi+rsi*1]
  4004d5:       c3                      ret

00000000004004d6 <_ZL4workii>:
  4004d6:       41 55                   push   r13
  4004d8:       41 89 fd                mov    r13d,edi
  4004db:       41 54                   push   r12
  4004dd:       41 89 f4                mov    r12d,esi
  4004e0:       55                      push   rbp
  4004e1:       bd 00 c2 eb 0b          mov    ebp,0xbebc200
  4004e6:       53                      push   rbx
  4004e7:       31 db                   xor    ebx,ebx
  4004e9:       41 8d 34 1c             lea    esi,[r12+rbx*1]
  4004ed:       41 8d 7c 1d 00          lea    edi,[r13+rbx*1+0x0]
  4004f2:       e8 db ff ff ff          call   4004d2 <_ZL3addRKiS0_.isra.0>
  4004f7:       01 c3                   add    ebx,eax
  4004f9:       ff cd                   dec    ebp
  4004fb:       75 ec                   jne    4004e9 <_ZL4workii+0x13>
  4004fd:       89 d8                   mov    eax,ebx
  4004ff:       5b                      pop    rbx
  400500:       5d                      pop    rbp
  400501:       41 5c                   pop    r12
  400503:       41 5d                   pop    r13
  400505:       c3                      ret

Assembly from gcc-4.7.3 -Os (executes in 0.822 secs):

00000000004004fa <_ZL3addRKiS0_.isra.0>:
  4004fa:       8d 04 37                lea    eax,[rdi+rsi*1]
  4004fd:       c3                      ret

00000000004004fe <_ZL4workii>:
  4004fe:       41 55                   push   r13
  400500:       41 89 f5                mov    r13d,esi
  400503:       41 54                   push   r12
  400505:       41 89 fc                mov    r12d,edi
  400508:       55                      push   rbp
  400509:       bd 00 c2 eb 0b          mov    ebp,0xbebc200
  40050e:       53                      push   rbx
  40050f:       31 db                   xor    ebx,ebx
  400511:       41 8d 74 1d 00          lea    esi,[r13+rbx*1+0x0]
  400516:       41 8d 3c 1c             lea    edi,[r12+rbx*1]
  40051a:       e8 db ff ff ff          call   4004fa <_ZL3addRKiS0_.isra.0>
  40051f:       01 c3                   add    ebx,eax
  400521:       ff cd                   dec    ebp
  400523:       75 ec                   jne    400511 <_ZL4workii+0x13>
  400525:       89 d8                   mov    eax,ebx
  400527:       5b                      pop    rbx
  400528:       5d                      pop    rbp
  400529:       41 5c                   pop    r12
  40052b:       41 5d                   pop    r13
  40052d:       c3                      ret

Assembly from gcc-4.8.1 -Os (executes in 0.994 secs):

00000000004004fd <_ZL3addRKiS0_.isra.0>:
  4004fd:       8d 04 37                lea    eax,[rdi+rsi*1]
  400500:       c3                      ret

0000000000400501 <_ZL4workii>:
  400501:       41 55                   push   r13
  400503:       41 89 f5                mov    r13d,esi
  400506:       41 54                   push   r12
  400508:       41 89 fc                mov    r12d,edi
  40050b:       55                      push   rbp
  40050c:       bd 00 c2 eb 0b          mov    ebp,0xbebc200
  400511:       53                      push   rbx
  400512:       31 db                   xor    ebx,ebx
  400514:       41 8d 74 1d 00          lea    esi,[r13+rbx*1+0x0]
  400519:       41 8d 3c 1c             lea    edi,[r12+rbx*1]
  40051d:       e8 db ff ff ff          call   4004fd <_ZL3addRKiS0_.isra.0>
  400522:       01 c3                   add    ebx,eax
  400524:       ff cd                   dec    ebp
  400526:       75 ec                   jne    400514 <_ZL4workii+0x13>
  400528:       89 d8                   mov    eax,ebx
  40052a:       5b                      pop    rbx
  40052b:       5d                      pop    rbp
  40052c:       41 5c                   pop    r12
  40052e:       41 5d                   pop    r13
  400530:       c3                      ret
newbie
  • 1,230
  • 1
  • 12
  • 21
Marat Dukhan
  • 11,993
  • 4
  • 27
  • 41
  • 212
    Just to make it clear: did you actually go and measure the performance of OP's code on 12 different platforms? (+1 for the mere thought that you would do that) – anatolyg Oct 19 '13 at 22:28
  • @MaratDukhan Thanks! The weird thing is: On my machine `-march=native` makes no difference; I have Intel Core i5-3320M. +1 anyway for your efforts! – Ali Oct 19 '13 at 22:28
  • 232
    @anatolyg Yes, I did! (and will add few more soon) – Marat Dukhan Oct 19 '13 at 22:31
  • 54
    Indeed. Another +1 for not only theorizing about different CPUs but actually *proving* it. Not something (alas) you see in every answer concerning speed. Are these tests run with the same OS? (As it might be possible this skews the result...) – Jongware Oct 19 '13 at 22:53
  • @MaratDukhan I am still a little skeptical. If it is not too much to ask, could you please show us the effects of `-O2 -fno-align-functions -fno-align-loops`? At least for those processors where `-mtune=native` seemed to alleviate the issue. Unfortunately, it can happen that `-march=native` only has effect because it influences the alignment. Many thanks! – Ali Oct 19 '13 at 23:01
  • 7
    @Ali On AMD-FX 6300 `-O2 -fno-align-functions -fno-align-loops` drops time to `0.340s`, so it could be explained by alignment. However, optimal alignment is processor-dependent: some processors prefer aligned loops and functions. – Marat Dukhan Oct 19 '13 at 23:06
  • 1
    @Jongware All tests on Linux (Ubuntu, RHEL, or Debian). – Marat Dukhan Oct 19 '13 at 23:07
  • @Ali It is both, as alignment depends on `-march` option. – Marat Dukhan Oct 19 '13 at 23:09
  • 15
    @Jongware I don't see how the OS would significantly influence the results; the loop never makes system calls. – Ali Oct 19 '13 at 23:09
  • @MaratDukhan True. But why doesn't `-march=native` have any effect on my machine? – Ali Oct 19 '13 at 23:12
  • 1
    @Ali I do not know. On Ivy Bridge Core i3 I get better performance from `-O2` with `gcc-4.7.3` and `gcc-4.8.1` and similar performance for `-O2`/`-Os` with `gcc-4.6.4`. – Marat Dukhan Oct 19 '13 at 23:34
  • @MaratDukhan I appreciate your efforts. My problem is: `-O2` is already supposed to do the right alignments. Yet, the correct thing happens if I *disable* alignments. Weird. Please check the effects of `-fno-align-functions -fno-align-loops` only with the `-O2` flag, no need to compare with the `-Os`. I would like to see how *disabling* aligments influences the performance on other platforms. – Ali Oct 19 '13 at 23:34
  • @Ali Have a look at the update. It seems that function/loop alignment is irrelevant. – Marat Dukhan Oct 20 '13 at 00:06
  • @MaratDukhan I am stuck. Please try passing `-fno-omit-frame-pointer`. On my machine it cases `-O2` to run in 0.38s and `-Os` in 0.44s which is the opposite if I don't pass this flag. I guess something gets aligned in an unfortunate way. I have tried cachegrind, the faster program has *more* cache misses. I would love to know what's going on here... – Ali Oct 20 '13 at 00:12
  • Marat Dukhan, there is Intel IACA to compare asm code behavior on Intel CPUs. – osgx Oct 20 '13 at 03:22
  • @Ali On Core-i3 3217I with `gcc-4.8.1 -fno-omit-frame-pointer` I get 0.709s with `-O2` and 0.826s with `-Os`. I don't think it is related to L1 cache, it is most likely related to L0 uop cache. I would suggest that performance degrades when one of the instructions on latency-critical path spans 16-byte boundary. Probably such instructions can not be cached in uop cache. – Marat Dukhan Oct 20 '13 at 05:46
  • 1
    @osgx I am afraid IACA simulations are too high-level for this kind of performance issue. Besides that, IACA simulates only linear blocks of code. – Marat Dukhan Oct 20 '13 at 05:47
  • @MaratDukhan OK, I believe I can explain your results as well! Please check my answer. Thanks for your efforts and posts, it helped us finding the answer. – Ali Oct 24 '13 at 15:40
  • 1
    I just ran a more realistic test: the regression suite for a small programming language: `make tests`. With -O2: 31 seconds; -Os: 41 seconds! And, ironically, the executable is somewhat larger with -Os! Text segment size of 476000 bytes versus 460000 for O2. Oops ... (gcc 4.6.3, Intel Core i5-2310, 2.9 GHz) – Kaz Oct 24 '13 at 22:34
  • You would think, however, that g++ -O3 -march=native would produce code that is optimized for the computer on which it is compiled. I have noticed that sometimes march=native works a bit, sometimes it generates code which is slower. I have not investigated. And of course, by now it might have changed as g++ itself is a moving target – Dov Jan 18 '21 at 15:23
212

My colleague helped me find a plausible answer to my question. He noticed the importance of the 256 byte boundary. He is not registered here and encouraged me to post the answer myself (and take all the fame).


Short answer:

Is it the padding that is the culprit in this case? Why and how?

It all boils down to alignment. Alignments can have a significant impact on the performance, that is why we have the -falign-* flags in the first place.

I have submitted a (bogus?) bug report to the gcc developers. It turns out that the default behavior is "we align loops to 8 byte by default but try to align it to 16 byte if we don't need to fill in over 10 bytes." Apparently, this default is not the best choice in this particular case and on my machine. Clang 3.4 (trunk) with -O3 does the appropriate alignment and the generated code does not show this weird behavior.

Of course, if an inappropriate alignment is done, it makes things worse. An unnecessary / bad alignment just eats up bytes for no reason and potentially increases cache misses, etc.

The noise it makes pretty much makes timing micro-optimizations impossible.

How can I make sure that such accidental lucky / unlucky alignments are not interfering when I do micro-optimizations (unrelated to stack alignment) on C or C++ source codes?

Simply by telling gcc to do the right alignment:

g++ -O2 -falign-functions=16 -falign-loops=16


Long answer:

The code will run slower if:

  • an XX byte boundary cuts add() in the middle (XX being machine dependent).

  • if the call to add() has to jump over an XX byte boundary and the target is not aligned.

  • if add() is not aligned.

  • if the loop is not aligned.

The first 2 are beautifully visible on the codes and results that Marat Dukhan kindly posted. In this case, gcc-4.8.1 -Os (executes in 0.994 secs):

00000000004004fd <_ZL3addRKiS0_.isra.0>:
  4004fd:       8d 04 37                lea    eax,[rdi+rsi*1]
  400500:       c3   

a 256 byte boundary cuts add() right in the middle and neither add() nor the loop is aligned. Surprise, surprise, this is the slowest case!

In case gcc-4.7.3 -Os (executes in 0.822 secs), the 256 byte boundary only cuts into a cold section (but neither the loop, nor add() is cut):

00000000004004fa <_ZL3addRKiS0_.isra.0>:
  4004fa:       8d 04 37                lea    eax,[rdi+rsi*1]
  4004fd:       c3                      ret

[...]

  40051a:       e8 db ff ff ff          call   4004fa <_ZL3addRKiS0_.isra.0>

Nothing is aligned, and the call to add() has to jump over the 256 byte boundary. This code is the second slowest.

In case gcc-4.6.4 -Os (executes in 0.709 secs), although nothing is aligned, the call to add() doesn't have to jump over the 256 byte boundary and the target is exactly 32 byte away:

  4004f2:       e8 db ff ff ff          call   4004d2 <_ZL3addRKiS0_.isra.0>
  4004f7:       01 c3                   add    ebx,eax
  4004f9:       ff cd                   dec    ebp
  4004fb:       75 ec                   jne    4004e9 <_ZL4workii+0x13>

This is the fastest of all three. Why the 256 byte boundary is speacial on his machine, I will leave it up to him to figure it out. I don't have such a processor.

Now, on my machine I don't get this 256 byte boundary effect. Only the function and the loop alignment kicks in on my machine. If I pass g++ -O2 -falign-functions=16 -falign-loops=16 then everything is back to normal: I always get the fastest case and the time isn't sensitive to the -fno-omit-frame-pointer flag anymore. I can pass g++ -O2 -falign-functions=32 -falign-loops=32 or any multiples of 16, the code is not sensitive to that either.

I first noticed in 2009 that gcc (at least on my projects and on my machines) have the tendency to generate noticeably faster code if I optimize for size (-Os) instead of speed (-O2 or -O3) and I have been wondering ever since why.

A likely explanation is that I had hotspots which were sensitive to the alignment, just like the one in this example. By messing with the flags (passing -Os instead of -O2), those hotspots were aligned in a lucky way by accident and the code became faster. It had nothing to do with optimizing for size: These were by sheer accident that the hotspots got aligned better. From now on, I will check the effects of alignment on my projects.

Oh, and one more thing. How can such hotspots arise, like the one shown in the example? How can the inlining of such a tiny function like add() fail?

Consider this:

// add.cpp
int add(const int& x, const int& y) {
    return x + y;
}

and in a separate file:

// main.cpp
int add(const int& x, const int& y);

const int LOOP_BOUND = 200000000;

__attribute__((noinline))
static int work(int xval, int yval) {
    int sum(0);
    for (int i=0; i<LOOP_BOUND; ++i) {
        int x(xval+sum);
        int y(yval+sum);
        int z = add(x, y);
        sum += z;
    }
    return sum;
}

int main(int , char* argv[]) {
    int result = work(*argv[1], *argv[2]);
    return result;
}

and compiled as: g++ -O2 add.cpp main.cpp.

      gcc won't inline add()!

That's all, it's that easy to unintendedly create hotspots like the one in the OP. Of course it is partly my fault: gcc is an excellent compiler. If compile the above as: g++ -O2 -flto add.cpp main.cpp, that is, if I perform link time optimization, the code runs in 0.19s!

(Inlining is artificially disabled in the OP, hence, the code in the OP was 2x slower).

Community
  • 1
  • 1
Ali
  • 56,466
  • 29
  • 168
  • 265
  • 21
    Wow... This definitely goes beyond what I usually do to get around benchmarking anomalies. – Mysticial Oct 24 '13 at 16:59
  • @Ali I guess that makes sense since how can the compiler inline something it doesn't see? That's probably why we use `inline` + function definition in the header. Not sure how mature lto is in gcc. My experience with it at least in mingw is a hit or miss. – greatwolf Oct 27 '13 at 15:25
  • @greatwolf Yes, *after the fact,* it is clear. :) I believe both the Intel C++ Compiler and the Visual Studio compiler can do link time optimization. In the latter it is called link time code generation, although I haven't used Windows for ages. – Ali Oct 27 '13 at 15:53
  • 9
    I think it was Communications of the ACM that had an article a few years ago about running fairly big applications (perl, Spice, etc.) while shifting the whole binary image one byte at a time by using Linux environments of different size. I recall typical variance of 15% or so. Their summary was that many benchmark results are useless because this external variable of alignment is not taken into account. – Gene Oct 27 '13 at 20:40
  • @Gene Interesting, and good to know! If you could dig up that paper, it would be great, thanks! – Ali Oct 27 '13 at 20:42
  • @Ali See my added answer below with a link to a related paper. I think the CACM article was a shortened version of this with a different title. – Gene Oct 27 '13 at 21:08
  • 3
    up'd particularly for `-flto`. it's pretty revolutionary if you've never used it before, speaking from experience :) – underscore_d Apr 03 '16 at 23:17
  • 2
    This is a fantastic video that talks about how alignment can impact performance and how to profile for it: https://www.youtube.com/watch?time_continue=1&v=r-TLSBdHe1A – Zhro Oct 05 '19 at 08:48
  • @greatwolf Re "how can the compiler inline something it doesn't see": But it does! The command line was "g++ -O2 **add.cpp** main.cpp"... and the gcc manual [says specifically:](https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Optimize-Options) " Compiling multiple files at once to a single output file in unit-at-a-time mode **allows the compiler to use information gained from all of the files** when compiling each of them.", and for `-O` that it turns on `-funit-at-a-time`. – Peter - Reinstate Monica Mar 09 '20 at 16:04
85

I'm adding this post-accept to point out that the effects of alignment on overall performance of programs - including big ones - has been studied. For example, this article (and I believe a version of this also appeared in CACM) shows how link order and OS environment size changes alone were sufficient to shift performance significantly. They attribute this to alignment of "hot loops".

This paper, titled "Producing wrong data without doing anything obviously wrong!" says that inadvertent experimental bias due to nearly uncontrollable differences in program running environments probably renders many benchmark results meaningless.

I think you're encountering a different angle on the same observation.

For performance-critical code, this is a pretty good argument for systems that assess the environment at installation or run time and choose the local best among differently optimized versions of key routines.

Gene
  • 46,253
  • 4
  • 58
  • 96
36

I think that you can obtain the same result as what you did:

I grabbed the assembly for -O2 and merged all its differences into the assembly for -Os except the .p2align lines:

… by using -O2 -falign-functions=1 -falign-jumps=1 -falign-loops=1 -falign-labels=1. I have been compiling everything with these options, that were faster than plain -O2 everytime I bothered to measure, for 15 years.

Also, for a completely different context (including a different compiler), I noticed that the situation is similar: the option that is supposed to “optimize code size rather than speed” optimizes for code size and speed.

If I guess correctly, these are paddings for stack alignment.

No, this has nothing to do with the stack, the NOPs that are generated by default and that options -falign-*=1 prevent are for code alignment.

According to Why does GCC pad functions with NOPs? it is done in the hope that the code will run faster but apparently this optimization backfired in my case.

Is it the padding that is the culprit in this case? Why and how?

It is very likely that the padding is the culprit. The reason padding is felt to be necessary and is useful in some cases is that code is typically fetched in lines of 16 bytes (see Agner Fog's optimization resources for the details, which vary by model of processor). Aligning a function, loop, or label on a 16-bytes boundary means that the chances are statistically increased that one fewer lines will be necessary to contain the function or loop. Obviously, it backfires because these NOPs reduce code density and therefore cache efficiency. In the case of loops and label, the NOPs may even need to be executed once (when execution arrives to the loop/label normally, as opposed to from a jump).

Pascal Cuoq
  • 79,187
  • 7
  • 161
  • 281
  • 1
    The funny thing is: `-O2 -fno-omit-frame-pointer` is just as good as `-Os`. Please check the updated question. – Ali Oct 20 '13 at 14:12
  • According to https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html, all your flags are enabled at both `-O2`, `-O3`. – xamid Sep 24 '20 at 05:27
  • @xamid Saying that `-O2` enables `-falign-functions -falign-jumps -falign-labels -falign-loops` is meaningless. Each of these options take a numerical argument. The argument they receive for most target architectures when they are enabled automatically by `-O2` is **not** `1`. Setting them to `1` effectively **disables** these “optimizations” that often make the binary code slower. – Pascal Cuoq Sep 24 '20 at 12:18
  • @PascalCuoq It reads for example "-fno-align-functions and -falign-functions=1 are equivalent and mean that functions are not aligned" But I just noticed that they probably mean just the opposite by the following "Enabled at levels -O2, -O3." So I guess my previous commment is _false_, not meaningless. I will check your flags, thanks. :-) – xamid Sep 24 '20 at 13:26
  • @xamid For the record I only meant that the GCC documentation was meaningless. It is right to quote documentation that says that “-falign-functions is enabled as -O2” below this answer. It is wrong for the documentation to say that “-falign-functions is enabled” when in a sense, it is always enabled, just with a different value. – Pascal Cuoq Sep 24 '20 at 19:01
  • 3
    I would like to mention that now that I used `-O3 -fno-align-functions -fno-align-jumps -fno-align-loops -fno-align-labels` instead of only `-O3`, my app actually runs faster, and it also decreased the executable file size. – xamid Sep 26 '20 at 08:47
13

If your program is bounded by the CODE L1 cache, then optimizing for size suddenly starts to pay out.

When last I checked, the compiler is not smart enough to figure this out in all cases.

In your case, -O3 probably generates code enough for two cache lines, but -Os fits in one cache line.

Joshua
  • 40,822
  • 8
  • 72
  • 132
  • 1
    How much you wanna bet those align= parameters relate to the size of cache lines? – Joshua Oct 24 '13 at 16:32
  • 1
    I don't really care anymore: It's not visible on my machine. And by passing the `-falign-*=16` flags, everything is back to normal, everything behaves consistently. As far as I am concerned, this question is resolved. – Ali Oct 24 '13 at 21:12
8

I'm by no means an expert in this area, but I seem to remember that modern processors are quite sensitive when it comes to branch prediction. The algorithms used to predict the branches are (or at least were back in the days I wrote assembler code) based on several properties of the code, including the distance of a target and on the direction.

The scenario which comes to mind is small loops. When the branch was going backwards and the distance was not too far, the branch predicition was optimizing for this case as all the small loops are done this way. The same rules might come into play when you swap the location of add and work in the generated code or when the position of both slightly changes.

That said, I have no idea how to verify that and I just wanted to let you know that this might be something you want to look into.

Community
  • 1
  • 1
Daniel Frey
  • 55,810
  • 13
  • 122
  • 180
  • Thanks. I played with it: I only get a speed up by swapping `add()` and `work()` if `-O2` is passed. In all other cases the code gets significantly slower by swapping. During the week-end, I also analyzed branch prediction / mis-prediction statistics with `perf` and I did not notice anything that could explain this weird behavior. The only consistent result is that in the slow case `perf` reports 100.0 in `add()` and a large value on the line right after the call to `add()` in the loop. It looks like we are stalling for some reason on `add()` in the slow case but not in the fast runs. – Ali Oct 22 '13 at 20:21
  • I am thinking about installing Intel's VTune on one of my machines and do a profiling myself. `perf` supports only a limited number of things, perhaps Intel's stuff is a bit more handy on their own processor. – Ali Oct 22 '13 at 20:29