Can more complex loop be executed faster?

Question

I found this out when preparing to implement a python-like ",".join(vector) function in C++. I wanted to compare whether it makes sense to eliminate an inner if first element condition used to avoid placing additional ',' at the beginning of the string. I wrote following functions:

long long test1() // 38332ms
{
    long long k = 0;
    for (long long i = 0; i < 10000000000; ++i)
    {
        if (i != 0)
        {
            k += 2;
        }
        k += 1;
    }

    return k;
}

long long test2() // 45272ms
{
    long long k = 0;
    k += 1;
    for (long long i = 1; i < 10000000000; ++i)
    {
        k += 2; // in the real code it might be impossible to merge
        k += 1; // those operations
    }

    return k;
}

I used the simplified code to make conditional jumps more meaningful. I was expecting that the branch prediction would minimize the difference. To my surprise the first function performed better. I was testing the Debug setting under VS2015. The compiler didn't perform any optimizations - I've checked the assemblies. I have also tried to move some things around - move functions definitions, call order, limiting constant. The ratio was roughly the same.

What could be a possible explanation?

Edit: Instead of analyzing this specific scenario I was trying to get some generic answer about possible reasons of such behavior. My guess is that CPU is performing some kind of heuristics during the branch prediction and that in my specific case the my specific CPU was better at predicting this branch in the test1. This is only my intuition so I was wandering whether it might be correct. Could anyone think about it while taking my guess into consideration?

How many times have you experimented with this? There's a lot that can affect completion time from run to run — Joey Wood, Sep 29 '16 at 23:39
I see the same effect on Linux with gcc 4.8.5 on a Haswell-based server. It's robust to reordering the functions, changing the constants, etc. Looking at the assembly output, the only difference is 3 extra instructions to handle the special case in `test`. Best guess is that it's an esoteric architecture-specific instruction alignment and/or instruction cache issue. Interestingly, `gcc` makes `test2` free when compiled with `-O3`, but it doesn't optimize away `test`. — Mr Fooz, Sep 29 '16 at 23:51
ten runs. Test 2 consistently longer by about 6/10ths of a second. GCC 4.8.1. No optimizations. AMD A10 processor — user4581301, Sep 29 '16 at 23:54
Why do you not set `k` to one to begin with and then do `k += 3` in the loop in `test2`? — NathanOliver, Sep 29 '16 at 23:55
What happens if you swap the order of the functions in the source file, so that test2 is first? — 1201ProgramAlarm, Sep 30 '16 at 01:21
By default the Visual Studio Debug setting specifies "/O0" disabling optimizations. — kfsone, Sep 30 '16 at 06:19

score 2 · Accepted Answer · edited May 23 '17 at 10:27

Since you compiled in debug mode, your loop probably stores/reloads k to memory every iteration (and same for the loop counter). Tiny loops in un-optimized code usually bottleneck on store-forwarding latency (~5 cycles on Intel Haswell), not on any kind of throughput bottleneck.

My guess is that CPU is performing some kind of heuristics during the branch prediction and that in my specific case the my specific CPU was better at predicting this branch in the test1.

That doesn't make sense. Perfect branch prediction can reduce the cost of a branch to zero, not make it negative.

i.e. doing two ADD instructions + a branch can't be faster than just doing two ADD insns, unless there's something else different as well.

Presumably MSVC in debug mode ends up making less-bad code for test1 than for test2. Perhaps it keeps k in a register across one of the increments, instead of doing both increments with a memory destination?

If you posted the asm, it would be possible to tell you what the compiler did differently that made the first version run faster, but it wouldn't be useful or interesting. (And has nothing to do with branch prediction; it should predict with at least 99.999% success, so the actual branch instruction is nearly free.)

Read Agner Fog's microarch pdf if you want to learn how CPUs work internally, and how to predict how fast a loop of assembly instructions will run.

See my answer to another question about optimizing for debug-mode about why it's meaningless and a waste of time, and not useful for learning about what's fast and what's slow.

Optimization doesn't just speed everything up by a constant factor, it changes what kind of bottlenecks you will experience. (e.g. k will live in a register, so k+=1 only has the latency of an ADD instruction, not of a round-trip through memory). And of course combining into a k+=3, and proving that the i!=0 branch only happens once, and peeling that iteration out.

More importantly, the result is a compile-time constant, so ideally both functions will compile to a single instruction that puts the result into the return-value register. Unfortunately, gcc6.2, clang3.9, and icc17 all fail to do that for version 1, so it looks like manually peeling out the first iteration is very helpful.

But for test2, all compilers have an easy time compiling it to movabs rax, 29999999998 / ret.

Related: What optimizing compilers do with these functions

If you want to look at compiler output, use function args instead of constants. You did already return the result, so that's good.

I put your functions up on the Godbolt Compiler explorer (which recently added a newer version of Intel's compiler than icc13).

clang-3.9 foolishly uses cmov for test1, from -O1 to -O3. A branch that predictable would be better done with a jump, if it's not peeled entirely. (Perhaps an unsigned loop counter / upper bound would help the compiler figure out what's going, but I didn't try.)

gcc uses a conditional branch for test1, and a function-arg counter version of test2. The branching looks reasonable, with the i!=0 case being the fall-through not-taken side of the CMP/JE. But it still actually counts the loop counter up to the max, as well as incrementing k.

In test2 gcc just runs up the loop counter and multiplies it outside the loop.

ICC17 unrolls test1 pretty aggressively, with multiple integer registers as accumulators so more than one ADD or INC can happen in parallel. IDK if it helps, because it spills one of its accumulators, with a memory-destination ADD. Actually I think its unrolled enough that the ~20 uop loop is only slowed down from 5 to 6 cycles per iteration by the bottleneck on a memory-destination ADD (on Intel Haswell), instead of the bottleneck on integer ALU ADD/INC instructions (4 per clock on Haswell). It does the +=2 and +=1 in different registers, and combines at the end (I assume, I just looked at the loop with the cmp rdx, 1249999999 end condition. It's amusing how aggressively it manages to optimize the loop without just optimizing it down to a constant like it does for test2.

I didn't try to study compiler optimizations but rather runtime optimization of the specific assembly. I know that it might vary depending on specific compiler and architecture so I should rather paste an assembly generated on my specific compiler, but I wanted it to be more clear what this code does. — pkubik, Sep 30 '16 at 09:45
@pkubik: The second half of my post, about what optimizing compilers do with your code, is not really part of the answer to what you asked. It's an answer to what you should have asked, or been doing. (i.e. testing the performance of optimized code, with a function arg as the upper bound). If that's not clear, go read the answer I linked about how experimenting and finding out what makes your code faster in debug mode is not useful. — Peter Cordes, Sep 30 '16 at 09:56
@pkubik: I didn't run the loops, but I'm pretty sure gcc's version of the runtime-variable version of test2 would run faster than test1, on an Intel Haswell. So your experiment points you in the opposite direction of what's good. — Peter Cordes, Sep 30 '16 at 09:57
@pkubik: updated my answer to fix the error, and rewrote the first section to address your edit more directly. In my first version, I was thinking the branch skipped the increment after the first iteration, which would explain a big speedup if it was translated naively/directly to asm. But it's the other way around, so any branching must not be directly related to the speedup. — Peter Cordes, Sep 30 '16 at 10:14
There are two branches - `for` condition and explicit `if` statement. I don't know whether CPU is actually performing such complex heuristics but if it was taking the instructions before the branch into account then I can imagine an algorithm that could predict 'better' given those more complex instructions from the test1. I'm quoting 'better' because it might be really better only in this specific case. — pkubik, Sep 30 '16 at 10:42
@pkubik: Are you thinking that test2 may be suffering from mispredicts of the `for()` condition? That makes no sense; it will predict perfectly on any normal CPU. Agner Fog's microarch pdf has a chapter on the branch prediction hardware in modern Intel designs, which you should read if you're curious about how this stuff works from a programmer's perspective (i.e. how to adapt your code for the existing hardware). Both branches in test1 will also predict perfectly. Run your program in a profiler with perf counters if you want to be sure. — Peter Cordes, Sep 30 '16 at 11:06
This is odd since all of my tests yielded better slightly better times for the test1. I thought there might be some simpler explanation. I guess that measurements were totally wrong. — pkubik, Sep 30 '16 at 15:02
@pkubik: wait, what? The comments in your question show a faster time for test1. If you got them backwards, then it's probably just test1 doing the same work + a bit extra (the compare/branch) that slows it down. (e.g. resource conflicts delaying the store-forwarding on the critical path). — Peter Cordes, Sep 30 '16 at 15:05

Can more complex loop be executed faster?

1 Answers1