2

I wrote a simple loop:

int volatile value = 0;

void loop(int limit) {
  for (int i = 0; i < limit; ++i) { 
      ++value;
  }
}

I compiled this with gcc and clang(-O3 -fno-unroll-loops) and got different outputs. They differ in ++value part:

clang:

  add dword ptr [rip + value], 1 # ++value
  add edi, -1                    # --limit
  jne .LBB0_1                    # if limit > 0 then continue looping

gcc:

  mov eax, DWORD PTR value[rip] # copy value to a register
  add edx, 1                    # ++i
  add eax, 1                    # increment a copy of value
  mov DWORD PTR value[rip], eax # store incremented copy to value, i. e. ++value
  cmp edi, edx                  # compare i < limit
  jne .L3                       # if i < limit then continue looping

C and C++ versions are same on each compiler(https://gcc.godbolt.org/z/5x5jGP) So my questions are:

1) Is gcc doing something wrong? What is the point of copying the value?

2) I have benchmarked that code and for some reason the profiler shows that in gcc's version 73% of time is wasted on instruction add edx, 1, 13% on mov DWORD PTR value[rip], eax and 13% on cmp edi, edx. Am I interpreting this results wrong? Why other addition and move instructions take less than 1% of the time?

3) Why can performance differ on gcc/clang in such a primitive code?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
SaveMyLife
  • 85
  • 6
  • 1
    Did you compile as C or C++? They are different languages. – NathanOliver Oct 25 '19 at 18:19
  • 1
    What happens if you remove `volatile`? – Barmar Oct 25 '19 at 18:21
  • 1
    @NathanOliver What's the difference that impacts this code? – Barmar Oct 25 '19 at 18:21
  • 8
    I feel it's a mistake to expect all compilers to produce assembly of equal quality. The output is different because they are different compilers that work differently. Compilers are not obligated to produce the best possible code. They just produce the best code *they can*. – François Andrieux Oct 25 '19 at 18:24
  • 2
    @FrançoisAndrieux I do not expect them to produce the same assembly. I do not believe that in the most simple case(**integer addition**) two major compilers could yield the code of different quality. There should be a reason for this! – SaveMyLife Oct 25 '19 at 18:28
  • 1
    Arguably the clang code is even wrong, since it does not produce individual load and store on the volatile object like the formal semantics would have it do. This may make a difference in the presence of certain memory-mapped registers, though probably not on x86, so it's probably not a real bug. – R.. GitHub STOP HELPING ICE Oct 25 '19 at 18:30
  • @Barmar 5.1.2.3.4 of C11 seems to allow optimizing away volatile but I'm not sure. Is that their version of the as-if rule? – NathanOliver Oct 25 '19 at 18:31
  • My guess: Clang doesn't care that standard is violated since it doesn't matter on the target system, and GCC doesn't care because they don't care about performance of volatile. – eerorika Oct 25 '19 at 18:32
  • 2
    @FrançoisAndrieux That's a bit disingenuous. While nothing forces them to be identical, you'd expect one of the most popular compilers to produce well-optimized code for such a simple, common construct. – Barmar Oct 25 '19 at 18:35
  • 2
    @R..: There is a load and a store in that `add` instruction. The C standard does not say `++value` has to result in separate load and store instructions; it says the access are evaluated strictly according to the rules of the abstract machine, but what constitutes an access to a volatile object is implementation-defined. Loading it from memory and storing it could satisfy the requirement for a read access and a write access. – Eric Postpischil Oct 25 '19 at 18:37
  • @EricPostpischil: It's implementation-defined, but needs to match what works on the underlying hardware architecture. If you had an architecture that did arithmetic on memory operands via offloading it to the memory controller (IIRC I've heard of such a thing) but memory-mapped hardware registers didn't support such operations, it wouldn't be a viable implementation for use with such hardware. As I said, I'm pretty sure this is not relevant to x86, but it still might be a mistake to have such combining transformations on volatile objects if you can't preclude them from happening on... – R.. GitHub STOP HELPING ICE Oct 25 '19 at 18:58
  • ...hardware architectures where it wouldn't be valid. – R.. GitHub STOP HELPING ICE Oct 25 '19 at 18:59
  • 1
    @uneven_mark: Indeed, but that's for C++. C is clear that the `++` operator happens as if by `x = x + 1`, where there is one abstract load and one abstract store. – R.. GitHub STOP HELPING ICE Oct 25 '19 at 18:59
  • 2
    @uneven_mark: That deprecation proposal is at the source language level; optimizing `volatile_store(value, volatile_load(value) + 1)` into a memory-destination add on x86 would still be legal. It's not even an atomic RMW so the store and load still exist separately. (On a single core it's atomic wrt. interrupts and context switches, but from the POV of things outside the core executing it, it's not. I doubt the standard or real implementations care about that level of detail.) So yes, related, but not strictly relevant. – Peter Cordes Oct 25 '19 at 21:29
  • *3) Why can performance differ on gcc/clang in such a primitive code?* I don't see any benchmark results in your question. Do you runtime performance of the loop? Or do you just mean how well the compiler performed in terms of making optimal code that's both fast and compact? – Peter Cordes Oct 25 '19 at 21:35
  • @NathanOliver-ReinstateMonica "_They are different languages_" How are they different in that simple code? – curiousguy Nov 03 '19 at 22:35
  • @EricPostpischil "_but what constitutes an access to a volatile object is implementation-defined_" What definition of an access would even fit one asm output and not the other? AFAIK "constitute an access" has nothing to do w/ how many asm instr are generated by a given C or C++ statement. – curiousguy Nov 03 '19 at 23:11
  • @FrançoisAndrieux "_I feel it's a mistake to expect all compilers to produce assembly of equal quality_" And I feel it's a mistake to see these outputs as having drastically different quality. I have seen really garbage code being produced, but that seems fine to me. – curiousguy Nov 13 '19 at 03:12

1 Answers1

7

This is all because you used volatile and GCC doesn't optimize it as aggressively

Without volatile, e.g. for a single ++*int_ptr, you get a memory-destination add. (And hopefully not inc when tuning for Intel CPUs; inc reg is fine but inc mem costs an extra uop vs. add 1. Unfortunately gcc and clang both get this wrong and use inc mem with -march=skylake: https://godbolt.org/z/_1Ri20)


clang knows that it can fold the volatile read / write accesses into the load and store portions of a memory-destination add.

GCC does not know how to do this optimization for volatile. Using volatile in GCC typically results in separate mov loads and stores, avoiding x86's ability to save code-size by using CISC memory operands for ALU instructions. On a load/store machine (like any RISC) you'd need separate load and store instructions anyway so it would be non-issue.

TL:DR: different compiler internals around volatile, specifically a GCC missed-optimization.

This missed optimization barely matter because volatile is rarely used. But feel free to report it on GCC's bugzilla if you want.

Without volatile, the loop would of course optimize away. But you can see a single memory-destination add from GCC or clang for a function that just does ++*p.

1) Is gcc doing something wrong? What is the point of copying the value?

It's only copying it to a register. We don't normally call this "copying", just bringing it into a register where it can operate on it.


Note that gcc and clang also differ in how they implement the loop condition, with clang optimizing to just dec/jnz (actually add -1, but it would use dec with -march=skylake or something with efficient dec, i.e. not Silvermont).

GCC spends an extra uop on the loop condition (on Intel CPUs where add/jnz can macro-fuse into a single uop). IDK why it compiles it naively like that.

73% of time is wasted on instruction add edx, 1

perf counters typically blame the instruction that's waiting for a slow result, not the instruction that's actually slow to produce it.

add edx,1 is waiting for the reload of value. With 4 to 5 cycle store-forwarding latency, this is the major bottleneck in your loop.

(Whether it's between the multiple uops of a memory-destination add or between separate instructions makes essentially no difference. There are no other memory accesses in your loop so none of the weird effects of store-forwarding latency being lower if you don't try too soon come into play: Adding a redundant assignment speeds up code when compiled without optimization or Loop with function call faster than an empty loop )

Why other addition and move instructions take less than 1% of the time?

Because out-of-order execution hides them under the latency of the critical path. They are very rarely the instruction that gets blamed when statistical sampling has to pick one out of the many that are in flight at once in any given cycle.

3) Why can performance differ on gcc/clang in such a primitive code?

I'd expect both those loops run at the same speed. Did you just mean performance as in how well the compilers themselves performed in making code that's both fast and compact?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50677 and several others. IIRC there may be a couple gcc targets that perform this optimization, but not x86. – Marc Glisse Oct 26 '19 at 20:28
  • Re: performance counters blaming the instruction waiting for the slow result: [Unexpectedly slow performance on moving from register to frequently accessed variable](https://stackoverflow.com/a/76774051) is an attempt at a simple canonical answer for that. – Peter Cordes Jul 26 '23 at 18:43