Why does this difference in asm matter for performance (in an un-optimized ptr++ vs. ++ptr loop)?

Question

TL;DR: the first loop runs ~18% faster on a Haswell CPU. Why? The loops are from gcc -O0 (un-optimized) loops using ptr++ vs ++ptr, but the question is why the resulting asm performs differently, not anything about how to write better C.

Let's say we have those two loops:

    movl    $0, -48(%ebp)     //Loop counter set to 0
    movl    $_data, -12(%ebp) //Pointer to the data array
    movl    %eax, -96(%ebp)
    movl    %edx, -92(%ebp)
    jmp L21
L22:
    // ptr++
    movl    -12(%ebp), %eax   //Get the current address
    leal    4(%eax), %edx     //Calculate the next address
    movl    %edx, -12(%ebp)   //Store the new (next) address
    // rest of the loop is the same as the other
    movl    -48(%ebp), %edx   //Get the loop counter to edx
    movl    %edx, (%eax)      //Move the loop counter value to the CURRENT address, note -12(%ebp) contains already the next one
    addl    $1, -48(%ebp)     //Increase the counter
L21:
    cmpl    $999999, -48(%ebp)
    jle     L22

and the second one:

    movl    %eax, -104(%ebp)
    movl    %edx, -100(%ebp)
    movl    $_data-4, -12(%ebp) //Address of the data - 1 element (4 byte)
    movl    $0, -48(%ebp)       //Set the loop counter to 0
    jmp L23
L24:
    // ++ptr
    addl    $4, -12(%ebp)       //Calculate the CURRENT address by adding one sizeof(int)==4 bytes
    movl    -12(%ebp), %eax     //Store in eax the address
    // rest of the loop is the same as the other
    movl    -48(%ebp), %edx     //Store in edx the current loop counter
    movl    %edx, (%eax)        //Move the loop counter value to the current stored address location
    addl    $1, -48(%ebp)       //Increase the loop counter
L23:
    cmpl    $999999, -48(%ebp)
    jle L24

Those loops are doing exactly the same thing but in a little different manner, please refer to the comment for the details.

This asm code is generated from the following two C++ loops:

    //FIRST LOOP:
    for(;index<size;index++){
        *(ptr++) = index;
    }
    //SECOND LOOP:
    ptr = data - 1;
    for(index = 0;index<size;index++){
        *(++ptr) = index;
    }

Now, the first loop is about ~18% faster than the second one, no matter in which order the loops are performed the one with ptr++ is faster than the one with ++ptr.

To run my benchmarks I just collected the running time of those loops for different size, and executing them both nested in other loops to repeat the operation frequently.

ASM analysis

Looking at the ASM code, the second loop contains less instructions, we have 3 movl and 2 addl whereas in the first loop we have 4 movl one addl and one leal, so we have one movl more and one leal instead of addl

Is that correct that the LEA operation for computing the correct address is much faster than the ADD (+4) method? Is this the reason for the difference in performance?

As far as i know, once a new address is calculated before the memory can be referenced some clock cycles must elapse, so the second loop after the addl $4,-12(%ebp) need to wait a little before proceeding, whereas in the first loop we can immediately refer the memory and in the meanwhile LEAL will compute the next address (some kind of better pipeline performance here).

Is there some reordering going on here? I'm not sure about my explanation for the performance difference of those loops, can I have your opinion?

Well, just for my curiosity, can you try reordered second loop? (init index to -1, move addl $4 right ahead of addl $1, and move L23: to ahead of addl $4 (two instructions sooner). I would expect that to fix the difference (although both codes look very wastefully complex, using lot of temporary memory instead of registers, etc.. so I suppose this is -O0). — Ped7g, May 16 '16 at 09:29
what microarchitecture are you testing on? Haswell? Sandybridge? — Peter Cordes, May 16 '16 at 09:36
Note that the second loop may be undefined behavior, due to `data - 1` underflowing. — EOF, May 16 '16 at 09:37
@EOF: can you elaborate on that? For me it looks like ordinary pointer math ... so ptr-1+1 = ptr+0, all looks fine to me. — Ped7g, May 16 '16 at 09:39
@Ped7g: C++ draft standard N3337: *5.7 Additive operators [...] If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.* While there is a special case for pointing *past* the last element of the array, there is no such exception for pointing *before* the first element. — EOF, May 16 '16 at 09:43
@EOF: It's also undefined behaviour to create an unaligned pointer with a cast, even if you don't dereference. However, we do that all the time with `__m128i` types. The semantics *are* well defined for this on most C implementations, including gcc targeting x86. Interesting point that the OP's idiom is technically undefined behaviour, though. Worth avoiding if there's an alternative that's just as good. — Peter Cordes, May 16 '16 at 09:47
@PeterCordes: Regrettably I'll have to disagree with you on this. Just because a compiler doesn't take advantage of this *yet* doesn't mean the behavior is defined. I would not be surprised if gcc at some point started to use the fact that a pointer can't point before the array, if they can emit faster code for it. — EOF, May 16 '16 at 09:51
@EOF: I see your point, but I don't think it's plausible for gcc to take advantage of this for anything useful. Most modern code don't use much static data, and a function that gets a pointer arg doesn't know if it's a pointer into the middle of an array or not. So there's not much incentive to try to develop optimizations based on knowing the object boundaries. I know that "I can't think of any" isn't a very convincing argument that there aren't any optimizations to be had here, but I think that's true even for cases where gcc does know object boundaries. — Peter Cordes, May 16 '16 at 10:18
@EOF: gcc is required to keep correctly compiling SSE intrinsics code that creates unaligned pointers to `__m128i` objects, because doing that is well defined on x86 (and C targeting x86), even if it's not defined in the C abstract machine. [There was some real benefit to be gained from assuming signed overflow never happens; e.g. in optimizing `for` loops with `int` counters that index array. Instead of sign-extending to 64bits every time, just use a 64bit reg.](http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html), so that justified breaking code with that undef behaviour. — Peter Cordes, May 16 '16 at 10:23
I had to edit the question to change my downvote to an upvote. I didn't really take the time to read it the first time. Anyway, I put the key information in places people will see it when skimming, which is essential when you have a big post. Most people will skim to see if it's interesting before slowing down to really read stuff. It's actually a much better question than most we get about un-optimized code. Nice job analysing the latency difference, that's a great start. — Peter Cordes, May 16 '16 at 10:42
18% speed difference in a nearly empty loop only in unoptimized code? I fail to see a practical case for this being an issue: did it arise from anything but curiosity? — Yakk - Adam Nevraumont, May 16 '16 at 10:55
(Did not read the answer or other comments, just writing my gut reaction here) My guess would be that since `ptr++` yields the value **first**, and increments **second**, the CPU could execute the assignment and the increment at the same time, as the assignment does not depent on the value after the incement operation. But `++ptr` increments **first**, and **only then** it yields the value back, which means that the memory location that you're dereferencing for writing would actually depend on the increment operation, so they can't be performed simultaneously, which leads to a halt. — notadam, May 16 '16 at 11:01
@Yakk: Of course nobody cares about the speed of unoptimized code in general. I should prob. edit the question title back to being about the asm, not the C, since the question is really about the asm that `gcc -O0` happens to generate. **You won't learn anything about how to write better / more efficient C, but you can learn something about the pipeline of modern x86 microarchitectures.** It turns out that the answer is probably resource conflicts. — Peter Cordes, May 16 '16 at 11:55
@peter well, sometimes you do care: if debug is too slow, it can make your code unusable, and/or prevent reproducing bugs in debug. I was just questioning 18% being enough to matter. ;) — Yakk - Adam Nevraumont, May 16 '16 at 13:30

Peter Cordes · Accepted Answer · 2016-05-16T11:47:46.363

First of all, performance analysis on -O0 compiler output is usually not very interesting or useful.

Is that correct that the LEAL operation for computing the correct address is much faster than the ADDL (+4) method? Is this the reason for the difference in performance?

Nope, add can run on every ALU execution port on any x86 CPU. lea is usually as low latency with simple addressing modes, but not as good throughput. On Atom, it runs in a different stage of the pipeline from normal ALU instructions, because it actually lives up to its name and uses the AGU on that in-order microarchitecture.

See the x86 tag wiki to learn what makes code slow or fast on different microarchitectures, esp. Agner Fog's microarchitecture pdf and instruction tables.

add is only worse because it lets gcc -O0 make even worse code by using it with a memory destination and then loading from that.

Compiling with -O0 doesn't even try to use the best instructions for the job. e.g. you'll get mov $0, %eax instead of the xor %eax,%eax you always get in optimized code. You shouldn't infer anything about what's good from looking at un-optimized compiler output.

-O0 code is always full of bottlenecks, usually on load/store or store-forwarding. Unfortunately IACA doesn't account for store-forwarding latency, so it doesn't realize that these loops actually bottleneck on

As far as i know, once a new address is calculated before the memory can be referenced some clock cycles must elapse, so the second loop after the addl $4,-12(%ebp) need to wait a little before proceeding,

Yes, the mov load of -12(%ebp) won't be ready for about 6 cycles after the load that was part of add's read-modify-write.

whereas in the first loop we can immediately refer the memory

Yes

and in the meanwhile LEAL will compute the next address

No.

Your analysis is close, but you missed the fact that the next iteration still has to load the value we stored into -12(%ebp). So the loop-carried dependency chain is the same length, and next iteration's lea can't actually start any sooner than in the loop using add

The latency issues might not be the loop throughput bottleneck:

uop / execution port throughput needs to be considered. In this case, the OP's testing shows it's actually relevant. (Or latency from resource conflicts.)

When gcc -O0 implements ptr++, it keeps the old value in a register, like you said. So store addresses are known further ahead of time, and there's one fewer load uop that needs an AGU.

Assuming an Intel SnB-family CPU:

## ptr++: 1st loop
movl    -12(%ebp), %eax   //1 uop (load)
leal    4(%eax), %edx     //1 uop (ALU only)
movl    %edx, -12(%ebp)   //1 store-address, 1 store-data
//   no load from -12(%ebp) into %eax
... rest the same.


 ## ++ptr:  2nd loop
addl    $4, -12(%ebp)       // read-modify-write: 2 fused-domain uops.  4 unfused: 1 load + 1 store-address + 1 store-data
movl    -12(%ebp), %eax     // load: 1 uop.   ~6 cycle latency for %eax to be ready
... rest the same

So the pointer-increment part of the 2nd loop has one more load uop. Probably the code bottlenecks on AGU throughput (address-generation units). IACA says that's the case for arch=SNB, but that HSW bottlenecks on store-data throughput (not AGUs).

However, without taking store-forwarding latency into account, IACA says the first loop can run at one iteration per 3.5 cycles, vs. one per 4 cycles for the second loop. That's faster than the 6 cycle loop-carried dependency of the addl $1, -48(%ebp) loop counter, which indicates that the loop is bottlenecked by latency to less than max AGU throughput. (Resource conflicts probably mean it actually runs slower than one iteration per 6c, see below).

We could test this theory:

Adding an extra load uop to the lea version, off the critical path, would take more throughput, but wouldn't be part of the loop's latency chains. e.g.

movl    -12(%ebp), %eax   //Get the current address
leal    4(%eax), %edx     //Calculate the next address
movl    %edx, -12(%ebp)   //Store the new (next) address

mov     -12(%ebp), %edx

%edx is about to be overwritten by a mov, so there are no dependencies on the result of this load. (The destination of mov is write-only, so it breaks dependency chains, thanks to register renaming.).

So this extra load would bring the lea loop up to the same number and flavour of uops as the add loop, but with different latency. If the extra load has no effect on speed, we know the first loop isn't bottlenecked on load / store throughput.

Update: OP's testing confirmed that an extra unused load slows the lea loop down to about the same speed as the add loop.

Why extra uops matter when we're not hitting execution port throughput bottlenecks

uops are scheduled in oldest-first order (out of uops that have their operands ready), not in critical-path-first order. Extra uops that could have been done in a spare cycle later on will actually delay uops that are on the critical path (e.g. part of the loop-carried dependency). This is called a resource conflict, and can increase the latency of the critical path.

i.e. instead of waiting for a cycle where critical-path latency left a load port with nothing to do, the unused load will run when it's the oldest load with its load-address ready. This will delay other loads.

Similarly, in the add loop where the extra load is part of the critical path, the extra load causes more resource conflicts, delaying operations on the critical path.

Other guesses:

So maybe having the store address ready sooner is what's doing it, so memory operations are pipelined better. (e.g. TLB-miss page walks can start sooner when approaching a page boundary. Even normal hardware prefetching doesn't cross page boundaries, even if they are hot in the TLB. The loop touches 4MiB of memory, which is enough for this kind of thing to matter. L3 latency is high enough to maybe create a pipeline bubble. Or if your L3 is small, then main memory certainly is.

Or maybe the extra latency just makes it harder for out-of-order execution to do a good job.

I updated my answer with a possible experiment to test the uop-throughput bottleneck theory. If someone wants to post results for all three versions of the loop (OP's originals, and `lea` + unused load), that might be interesting, as long as we know what CPU microarchitecture the result is for. — Peter Cordes, May 16 '16 at 11:01
I made the experiment, and is exacly as you said! Adding one extra movl (movl -12(%ebp), %edx) just before edx is overwritten by -48(%ebp) cause the performance difference to shrink to a negligible value!The difference in performance is now less than ~2%. I'm running the tests on a Haswell processor (i5-4300u). — fjanisze, May 16 '16 at 11:20
@fjanisze: Thanks for testing, that result gave me a clue about what's going on: updated my answer again. Even though load-uop throughput isn't the bottleneck, having more of them causes resource conflicts for the loads that are on the critical path. — Peter Cordes, May 16 '16 at 11:49

Why does this difference in asm matter for performance (in an un-optimized ptr++ vs. ++ptr loop)?

ASM analysis

1 Answers1

The latency issues might not be the loop throughput bottleneck:

We could test this theory:

Why extra uops matter when we're not hitting execution port throughput bottlenecks

Linked