GCC standard optimizations behavior

Question

Here I compile an input program with -O2 optimization level (with gcc 4.8.4) and measure the execution time:

gcc -O2 -c test.c -o obj.o
TIMEFORMAT='%3R' &&  time(./obj.o)
execution time = 1.825

and when I replace -O2 flag with the list of options that are turned on as defined in GCC manuel in the level -O2 https://gcc.gnu.org/onlinedocs/gcc-4.8.4/gcc/Optimize-Options.html#Optimize-Options like that:

gcc -fauto-inc-dec -fcompare-elim -fcprop-registers -fdce -fdefer-pop -fdse -fguess-branch-probability -fif-conversion2 -fif-conversion -fipa-pure-const -fipa-profile -fipa-reference -fmerge-constants -fsplit-wide-types -ftree-bit-ccp  -ftree-builtin-call-dce -ftree-ccp -ftree-ch -ftree-copyrename -ftree-dce -ftree-dominator-opts -ftree-dse -ftree-forwprop -ftree-fre -ftree-phiprop -ftree-slsr -ftree-sra -ftree-pta -ftree-ter -funit-at-a-time -fthread-jumps -falign-functions  -falign-jumps -falign-loops  -falign-labels -fcaller-saves -fcrossjumping -fcse-follow-jumps  -fcse-skip-blocks -fdelete-null-pointer-checks -fdevirtualize -fexpensive-optimizations -fgcse  -fgcse-lm  -fhoist-adjacent-loads -finline-small-functions -findirect-inlining -fipa-sra -foptimize-sibling-calls -fpartial-inlining -fpeephole2 -fregmove  -freorder-blocks  -freorder-functions -frerun-cse-after-loop -fsched-interblock  -fsched-spec -fschedule-insns  -fschedule-insns2 -fstrict-aliasing -fstrict-overflow -ftree-switch-conversion -ftree-tail-merge -ftree-pre -ftree-vrp -c test.c -o obj.o
    TIMEFORMAT='%3R' &&  time(./obj.o)
execution time = 2.652

My question is why the execution time is different even so, I applied the same optimizations ?

UPDATE

if (according to GCC documentation):

Not all optimizations are controlled directly by a flag.

So how can researchers use to reproduce optimization sequences even faster than standard optimization sequences (using evolutionary algorithms they use to generate thousands of optimization sequences and gather those with highest impact in term of execution time)

as an example "Acovea" http://hg.ahs3.net/acovea/debian/html/acoveaga.html

and "Cole" http://users.elis.ugent.be/~leeckhou/papers/cgo08.pdf

There may be a more satisfying explanation, but on preemptive multitasking systems (e.g., all desktop, mobile, and server), there's always just the chance that something else was going on. Generally, people tend to run many timing tests and average them out to get more useful numbers that aren't at the mercy of your backup software :). — Linuxios, Nov 20 '15 at 17:51
1) is the code really the same? If not: Maybe you better ask this question at the gcc mailing list. If yes: your benchmark-method is wrong. — too honest for this site, Nov 20 '15 at 17:52
@Olaf If the code is different - why ask on gcc mailing list? — Eugene Sh., Nov 20 '15 at 17:53
Code is the same. I already asked the question to GCC people I am waiting for the answer — staticx, Nov 20 '15 at 17:54
@Linuxios: Normally, you can very well take the execution time of the process only. — too honest for this site, Nov 20 '15 at 17:54
The document says `Depending on the target and how GCC was configured, a slightly different set of optimizations may be enabled at each -O level than those listed here. You can invoke GCC with -Q --help=optimizers to find out the exact set of optimizations that are enabled at each level. ` ... have you tried `-Q --help=optimizers` — Shafik Yaghmour, Nov 20 '15 at 17:55
Your list of flags doesn't seem to match the one in the docs; for example, you're missing `-fdelayed-branch`. — user2357112, Nov 20 '15 at 17:55
@EugeneSh.: Provided OP added all listed options (I did not check), there seem to be additional optimisations not listed or without distinct option, so the code differs. - All presumed the measuring method is correct (which should be verified first, of course). Otherwise there are two variables (different code and external influences to timing) — too honest for this site, Nov 20 '15 at 17:56
@user2357112 -fdelayed-branch for Fortran. I don't need it. I eliminated only this option — staticx, Nov 20 '15 at 17:56
@staticx Try to measure the time of running the same code a few zillions of times instead of just one. It might show that the difference is not that big after all.. — Eugene Sh., Nov 20 '15 at 17:58
@ShafikYaghmour can you have a look to this topic? We have already discussed the command GCC with -Q --help=optimizers. It is not so effective and depends on previous executed optimizations http://stackoverflow.com/questions/33278757/disable-all-optimization-options-in-gcc and here http://gcc.1065356.n5.nabble.com/PATCH-clarify-documentation-of-Q-help-optimizers-td1197456.html#a1198555 — staticx, Nov 20 '15 at 18:01
@staticx: Fortran? I don't think there's anything language-specific about branch delay. Architecture-specific, sure - I don't think delayed branches are a thing on x86 - but maybe it's affecting something anyway. — user2357112, Nov 20 '15 at 18:02
@Linuxios: I did not verify `time` right now, but there are basically two times available: the system time and the process time. The latter only counts the time the process is active. I'm currently too lazy to check if OP uses the correct approach. But as the code is identical, I think EugeneSh. has the best point right now (that will also make the difference between the times negligible). — too honest for this site, Nov 20 '15 at 18:03
@user2357112 Okay I enabled it and I got same results with "warning: this target machine does not have delayed branches" — staticx, Nov 20 '15 at 18:04
@Olaf: OK! That makes sense. You are right, process time ("CPU time") as opposed to real-world time ("wall clock time") will remove that issue. — Linuxios, Nov 20 '15 at 18:05
As the code is the same, your question seems pointless. Your measuring methodology is apparently wrong. Provide a [mcve] — too honest for this site, Nov 20 '15 at 18:07
It also says `Not all optimizations are controlled directly by a flag. Only optimizations that have a flag are listed in this section. ` — Shafik Yaghmour, Nov 20 '15 at 18:07
I am taking "TIMEFORMAT='%3R'" Clock time. I don't think it is wrong. Since optimizations aim to decrease the execution time — staticx, Nov 20 '15 at 18:08

score 0 · Answer 1 · answered Nov 29 '21 at 16:40

Did you know that the sole memory layout of a program can have a huge performance impact of up to 40%? Here are some rather stupid things that will change the memory layout of your application:

Order of linking external libraries at link time
Number and contents of environment variables at runtime
The name of the directory from that you are running the binary

Yes, I'm not kidding here. Just because you run the same binary from location A instead of location B, it may run 40% faster/slower, as the path is pushed to the stack and thus the length of it changes the stack layout and this influences cashing, branch predictions, etc.

Most people who are measuring performance today simply do it incorrectly. Profilers are a far less useful tool as they used to be some couple of years ago, when CPU still scaled linearly.

Here's the video of a talk about performance that can really open your eyes on that subject and how to correctly measure it:

https://www.youtube.com/watch?v=r-TLSBdHe1A

I highly recommend every low level programmer to watch that video. It's 42 minutes and it's really worth watching. Also they show you a couple of free tools you can use to measure performance improvements without falling for statistically trickery.

And yes, it directly addresses your question as how do compiler developers know that O2 is really better than O1 or do they actually know? As in fact, this isn't always case.

Performance optimization is a fragile thing and an optimization make your app much faster on one system can have the opposite effect on another one, so for real optimization you need to calculate how likely it is, that a change in runtime is just coincident because things like the memory layout has changed or is a real, actual improvement. And even that topic is briefly covered in the video.

Aside from this video, I can only suggest you take a look at the generated assembly code. If the code is different, that may explain the runtime difference and also provide a hint how the optimizations differ. Maybe it all boils down to documentation that is outdated and in fact O2 just turns on/off flags not mentioned. And if the code is identical, maybe it's one of the coincident effects that only happen during liking.

So try adding -S to the gcc call and compare the generated assembly code.

Peter Cordes · Answer 2 · 2022-11-24T03:59:11.747

Last I checked, passing all the -f options implied by -O2 (but not any -O option) still didn't stop GCC from spilling/reloading values to memory between statements, i.e. debug-mode behaviour for consistent debugging. So that part of the default -O0 was still on.
(Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?) This will be obvious if you look at the asm output:
How to remove "noise" from GCC/clang assembly output?

This is of course total garbage for performance. The store/reload latency will be a big bottleneck even for modern CPUs with out-of-order exec. And there won't be any constant-propagation through non-const variables.

Marc Glisse commented about this a few years ago, and pointed out that the GCC docs say, at the top of 3.11 Options That Control Optimization -

Not all optimizations are controlled directly by a flag. Only optimizations that have a flag are listed in this section.

So if you want GCC to try really hard to optimize your code, always start with -O3 (and preferably -march=native if you only want to run it on this machine, also hopefully -flto or something to enable cross-file inlining).

Then look at adding more -f and/or -m options to -O3, if there are some that aren't enabled as part of -O3. (e.g. -funroll-loops, which may make things worse for large programs, so is only enabled by default as part of profile-guided optimization, -fprofile-generate / run at least once with some representative inputs / -fprofile-use.)

Or subtract some from -O3, or start with -O2 and add some if that's more what you're aiming for, if -O3 might have some impact on tuning heuristics like being more willing to unroll farther.

The statement that follows the one you quote from GCC docs explains the situation: *Most optimizations are completely disabled at -O0 or if an -O level is not set on the command line, even if individual optimization flags are specified. Similarly, -Og suppresses many optimization passes.* There are three separate pipelines, the `-f` flags cannot enable a pass that is not scheduled in a pipeline at all. — amonakov, Apr 30 '23 at 10:00

score -3 · Answer 3 · answered Nov 20 '15 at 20:16

-3

There is 2 good optimization that should be know:

-O2: Optimize space and time (caution: does not initialize variable)
-Os: Optimize space and time (caution: does not initialize variable)

answered Nov 20 '15 at 20:16

Gromph

123
12

GCC standard optimizations behavior

3 Answers3

Linked