gcc optimization flag -O3 makes code slower than -O2

Question

I find this topic Why is it faster to process a sorted array than an unsorted array? . And try to run this code. And I find strange behavior. If I compile this code with -O3 optimization flag it takes 2.98605 sec to run. If I compile with -O2 it takes 1.98093 sec. I try to run this code several times(5 or 6) on the same machine in the same environment, I close all other software(chrome, skype etc).

gcc --version
gcc (Ubuntu 4.9.2-0ubuntu1~14.04) 4.9.2
Copyright (C) 2014 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

So please can you explain to me why this happens? I read gcc manual and I see that -O3 includes -O2. Thank you for help.

P.S. add code

#include <algorithm>
#include <ctime>
#include <iostream>

int main()
{
    // Generate data
    const unsigned arraySize = 32768;
    int data[arraySize];

    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;

    // !!! With this, the next loop runs faster
    std::sort(data, data + arraySize);

    // Test
    clock_t start = clock();
    long long sum = 0;

    for (unsigned i = 0; i < 100000; ++i)
    {
        // Primary loop
        for (unsigned c = 0; c < arraySize; ++c)
        {
            if (data[c] >= 128)
                sum += data[c];
        }
    }

    double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;

    std::cout << elapsedTime << std::endl;
    std::cout << "sum = " << sum << std::endl;
}

Did you benchmark several times your program? What is your exact processor? What exact code do you have? Did you try to compile with `gcc -O3 -mtune=native` ? And be sure to run *several times* a program which lasts a few seconds (not centiseconds). — Basile Starynkevitch, Mar 05 '15 at 10:20
Did you run each program once? You should try a few times. Also make sure *nothing* else is running on the machine you use for benchmarking, — doctorlove, Mar 05 '15 at 10:21
@BasileStarynkevitch i add code. I try several times and have same results. I try to compile with `-mtune=native` - same result as before(without this flag). Processor - Intel Core i5 -2400 — Mike Minaev, Mar 05 '15 at 10:24
I just experimented a little bit and added to `O2` additional optimizations that `O3` performs one at a time. The additional optimization flags that O3 adds for me are: `-fgcse-after-reload -finline-functions -fipa-cp-clone -fpredictive-commoning -ftree-loop-distribute-patterns -ftree-vectorize -funswitch-loops`. I found that adding `-ftree-vectorize` as optimization flag to O2 is the one that has the negative impact. I'm on Windows 7 with mingw-gcc 4.7.2. — halex, Mar 05 '15 at 10:45
@halex that sounds more like a potential answer than a comment — doctorlove, Mar 05 '15 at 11:05
@doctorlove I can't explain why it is slower with autovectorization of loops so i thought it's too little information for an answer :) — halex, Mar 05 '15 at 11:11
Another interesting observation is that according to the output of `-ftree-vectorizer-verbose=2` no loop is vectorized, so i don't understand why it has such a negative impact on the runtime. — halex, Mar 05 '15 at 11:25
Changing the variable `sum` from a local to a global or static one makes the difference between O2 and O3 vanish. The problem seems to be related to lots of stack operations to store and retrieve the variable `sum` inside the loop if it's local. My knowledge of Assembly is too limited to fully understand the generated code by gcc:) — halex, Mar 05 '15 at 13:18

Peter Cordes · Accepted Answer · 2017-10-23T01:30:51.230

gcc -O3 uses a cmov for the conditional, so it lengthens the loop-carried dependency chain to include a cmov (which is 2 uops and 2 cycles of latency on your Intel Sandybridge CPU, according to Agner Fog's instruction tables. See also the x86 tag wiki). This is one of the cases where cmov sucks.

If the data was even moderately unpredictable, cmov would probably be a win, so this is a fairly sensible choice for a compiler to make. (However, compilers may sometimes use branchless code too much.)

I put your code on the Godbolt compiler explorer to see the asm (with nice highlighting and filtering out irrelevant lines. You still have to scroll down past all the sort code to get to main(), though).

.L82:  # the inner loop from gcc -O3
    movsx   rcx, DWORD PTR [rdx]  # sign-extending load of data[c]
    mov     rsi, rcx
    add     rcx, rbx        # rcx = sum+data[c]
    cmp     esi, 127
    cmovg   rbx, rcx        # sum = data[c]>127 ? rcx : sum
    add     rdx, 4          # pointer-increment
    cmp     r12, rdx
    jne     .L82

gcc could have saved the MOV by using LEA instead of ADD.

The loop bottlenecks on the latency of ADD->CMOV (3 cycles), since one iteration of the loop writes rbx with CMO, and the next iteration reads rbx with ADD.

The loop only contains 8 fused-domain uops, so it can issue at one per 2 cycles. Execution-port pressure is also not as bad a bottleneck as the latency of the sum dep chain, but it's close (Sandybridge only has 3 ALU ports, unlike Haswell's 4).

BTW, writing it as sum += (data[c] >= 128 ? data[c] : 0); to take the cmov out of the loop-carried dep chain is potentially useful. Still lots of instructions, but the cmov in each iteration is independent. This compiles as expected in gcc6.3 -O2 and earlier, but gcc7 de-optimizes into a cmov on the critical path (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82666). (It also auto-vectorizes with earlier gcc versions than the if() way of writing it.)

Clang takes the cmov off the critical path even with the original source.

gcc -O2 uses a branch (for gcc5.x and older), which predicts well because your data is sorted. Since modern CPUs use branch-prediction to handle control dependencies, the loop-carried dependency chain is shorter: just an add (1 cycle latency).

The compare-and-branch in every iteration is independent, thanks to branch-prediction + speculative execution, which lets execution continue before the branch direction is known for sure.

.L83:   # The inner loop from gcc -O2
    movsx   rcx, DWORD PTR [rdx]  # load with sign-extension from int32 to int64
    cmp     ecx, 127
    jle     .L82        # conditional-jump over the next instruction 
    add     rbp, rcx    # sum+=data[c]
.L82:
    add     rdx, 4
    cmp     rbx, rdx
    jne     .L83

There are two loop-carried dependency chains: sum and the loop-counter. sum is 0 or 1 cycle long, and the loop-counter is always 1 cycle long. However, the loop is 5 fused-domain uops on Sandybridge, so it can't execute at 1c per iteration anyway, so latency isn't a bottleneck.

It probably runs at about one iteration per 2 cycles (bottlenecked on branch instruction throughput), vs. one per 3 cycles for the -O3 loop. The next bottleneck would be ALU uop throughput: 4 ALU uops (in the not-taken case) but only 3 ALU ports. (ADD can run on any port).

This pipeline-analysis prediction matches pretty much exactly with your timings of ~3 sec for -O3 vs. ~2 sec for -O2.

Haswell/Skylake could run the not-taken case at one per 1.25 cycles, since it can execute a not-taken branch in the same cycle as a taken branch and has 4 ALU ports. (Or slightly less since a 5 uop loop doesn't quite issue at 4 uops every cycle).

(Just tested: Skylake @ 3.9GHz runs the branchy version of the whole program in 1.45s, or the branchless version in 1.68s. So the difference is much smaller there.)

g++6.3.1 uses cmov even at -O2, but g++5.4 still behaves like 4.9.2.

With both g++6.3.1 and g++5.4, using -fprofile-generate / -fprofile-use produces the branchy version even at -O3 (with -fno-tree-vectorize).

The CMOV version of the loop from newer gcc uses add ecx,-128 / cmovge rbx,rdx instead of CMP/CMOV. That's kinda weird, but probably doesn't slow it down. ADD writes an output register as well as flags, so creates more pressure on the number of physical registers. But as long as that's not a bottleneck, it should be about equal.

Newer gcc auto-vectorizes the loop with -O3, which is a significant speedup even with just SSE2. (e.g. my i7-6700k Skylake runs the vectorized version in 0.74s, so about twice as fast as scalar. Or -O3 -march=native in 0.35s, using AVX2 256b vectors).

The vectorized version looks like a lot of instructions, but it's not too bad, and most of them aren't part of a loop-carried dep chain. It only has to unpack to 64-bit elements near the end. It does pcmpgtd twice, though, because it doesn't realize it could just zero-extend instead of sign-extend when the condition has already zeroed all negative integers.

BTW, I saw this question ages ago, probably when it was first posted, but I guess got side-tracked from answering it until now (when I was reminded of it). — Peter Cordes, May 12 '17 at 15:53
Do `-fprofile-generate` and `-fprofile-use` help in this case? — Marc Glisse, May 12 '17 at 16:55
@MarcGlisse: Just tested: yes, g++5.4 and g++6.3.1 make the same branchy code with `-O3 -fno-tree-vectorize -fprofile-use`. (Even though without PGO, g++6.3.1 uses CMOV even at `-O2`). On 3.9GHz Skylake, the CMOV version runs in 1.68s, while the branchy version runs in 1.45s, so the difference is much smaller with efficient CMOV. — Peter Cordes, May 12 '17 at 17:22
@MarcGlisse: updated the answer with more stuff. Why is newer gcc using `add ecx, -128` instead of a CMP? Is that just for code-size reasons (since -128 fits in a sign-extended imm8)? I guess that's probably worth writing ecx for no reason, since it's dead at that point and OOO execution can free it soon. I'm surprised it still doesn't use LEA to compute `sum+data[c]` in a different register to avoid the MOV, though. — Peter Cordes, May 12 '17 at 17:38
A lot of it seems to be tuning choices, playing with `-mtune=...` changes add to cmp. No idea about lea. On a skylake laptop, -O3 code is significantly faster than -O2 code. — Marc Glisse, May 13 '17 at 10:12
Random question but do you know how to get MSVC to stop emitting CMOV for a branch in a tight loop on x64? Sometimes a small change works but sometimes I've really had trouble finding any way to coax it to. — user541686, Oct 17 '21 at 18:08
@user541686: IDK, I don't use MSVC, and don't generally tune code to get it to make good asm. [Generating CMOV instructions using Microsoft compilers](https://stackoverflow.com/q/13661285) is the opposite problem, although some of the answer there is about an older MSVC version that rarely used cmov. — Peter Cordes, Oct 17 '21 at 19:40

gcc optimization flag -O3 makes code slower than -O2

1 Answers1

Linked