This is all because you used volatile
and GCC doesn't optimize it as aggressively
Without volatile, e.g. for a single ++*int_ptr
, you get a memory-destination add. (And hopefully not inc
when tuning for Intel CPUs; inc reg
is fine but inc mem
costs an extra uop vs. add 1. Unfortunately gcc and clang both get this wrong and use inc mem
with -march=skylake
: https://godbolt.org/z/_1Ri20)
clang knows that it can fold the volatile
read / write accesses into the load and store portions of a memory-destination add
.
GCC does not know how to do this optimization for volatile
. Using volatile
in GCC typically results in separate mov
loads and stores, avoiding x86's ability to save code-size by using CISC memory operands for ALU instructions. On a load/store machine (like any RISC) you'd need separate load and store instructions anyway so it would be non-issue.
TL:DR: different compiler internals around volatile
, specifically a GCC missed-optimization.
This missed optimization barely matter because volatile
is rarely used. But feel free to report it on GCC's bugzilla if you want.
Without volatile
, the loop would of course optimize away. But you can see a single memory-destination add
from GCC or clang for a function that just does ++*p
.
1) Is gcc doing something wrong? What is the point of copying the value?
It's only copying it to a register. We don't normally call this "copying", just bringing it into a register where it can operate on it.
Note that gcc and clang also differ in how they implement the loop condition, with clang optimizing to just dec/jnz (actually add -1
, but it would use dec
with -march=skylake or something with efficient dec
, i.e. not Silvermont).
GCC spends an extra uop on the loop condition (on Intel CPUs where add/jnz
can macro-fuse into a single uop). IDK why it compiles it naively like that.
73% of time is wasted on instruction add edx, 1
perf counters typically blame the instruction that's waiting for a slow result, not the instruction that's actually slow to produce it.
add edx,1
is waiting for the reload of value
. With 4 to 5 cycle store-forwarding latency, this is the major bottleneck in your loop.
(Whether it's between the multiple uops of a memory-destination add
or between separate instructions makes essentially no difference. There are no other memory accesses in your loop so none of the weird effects of store-forwarding latency being lower if you don't try too soon come into play:
Adding a redundant assignment speeds up code when compiled without optimization or Loop with function call faster than an empty loop )
Why other addition and move instructions take less than 1% of the time?
Because out-of-order execution hides them under the latency of the critical path. They are very rarely the instruction that gets blamed when statistical sampling has to pick one out of the many that are in flight at once in any given cycle.
3) Why can performance differ on gcc/clang in such a primitive code?
I'd expect both those loops run at the same speed. Did you just mean performance as in how well the compilers themselves performed in making code that's both fast and compact?