3

I wrote this simple C++ code, to see how atomic variables are implemented.

#include <atomic>

using namespace std;

atomic<float> f(0);

int main() {
    f += 1.0;
}

It is generating this assembly for main in -O3:

main:
        mov     eax, DWORD PTR f[rip]
        movss   xmm1, DWORD PTR .LC1[rip]
        movd    xmm0, eax
        mov     DWORD PTR [rsp-4], eax  ; this line is redundant
        addss   xmm0, xmm1
.L2:
        mov     eax, DWORD PTR [rsp-4]  ; this line is redundant
        movd    edx, xmm0
        lock cmpxchg    DWORD PTR f[rip], edx
        je      .L5
        mov     DWORD PTR [rsp-4], eax  ; this line is redundant
        movss   xmm0, DWORD PTR [rsp-4]  ; this line can become  movd    xmm0, eax
        addss   xmm0, xmm1
        jmp     .L2
.L5:
        xor     eax, eax
        ret
f:
        .zero   4
.LC1:
        .long   1065353216

It is using the atomic compare and exchange technique to achieve atomicity. But there, the old value is being stored in the stack at [rsp-4]. But in the above code, eax is invariant. So the old value is preserved in eax itself. Why is the compiler allocating additional space for the old value? Even in -O3!! Is there any specific reason to store that variable in the stack rather than in registers?

EDIT: Logical deduction -

There are 4 lines that use rsp-4 -

mov     DWORD PTR [rsp-4], eax    --- 1
mov     eax, DWORD PTR [rsp-4]    --- 2  <--.
mov     DWORD PTR [rsp-4], eax    --- 3     | loop
movss   xmm0, DWORD PTR [rsp-4]   --- 4  ---'

Lines 3 and 4 have absolutely nothing else in-between, and hence 4 can be re written using 3 as
movd xmm0, eax.

Now, when going from line 3 to 2 in the loop, there is no modification to rsp-4(nor eax). So it implies that lines 3 and 2 in sequence together collapse to
mov eax, eax
which is redundant by nature.

Finally, only line 1 remains, whose destination is never used again. So it is also redundant.

Boann
  • 48,794
  • 16
  • 117
  • 146
Sourav Kannantha B
  • 2,860
  • 1
  • 11
  • 35
  • @Yksisarvinen yep, it is pretty straightforward. After seeing that, I moved onto floats ;) – Sourav Kannantha B Jul 06 '21 at 15:02
  • @SouravKannanthaB They were responding to a slightly misleading comment of mine wondering if this was related to float not having an atomic addition. –  Jul 06 '21 at 15:04
  • clang's code is more sensible: https://godbolt.org/z/qeGK8KzW4. This looks like a simple missed optimization bug, you could report it. – Nate Eldredge Jul 14 '21 at 18:24

1 Answers1

3

Is there any specific reason to store that variable in stack rather than in registers?

At the end of the day, atomics exist for inter-thread communication, and you can't share a register across threads.

You might think that gcc could detect local variable atomics that are never shared with anything else and demote them to a regular variable. However:

  1. I personally don't see what this brings to the table since you shouldn't be using atomics in these cases.
  2. The standard appears to prohibit such an optimization anyways:

intro.races-14

The value of an atomic object M, as determined by evaluation B, shall be the value stored by some side effect A that modifies M, where B does not happen before A.

The key word here is side effect, which means that the modification of the actual memory storage is not up for debate. It HAS to happen.

As far as the revised question goes:

But in the above code, eax is invariant

It's unfortunately not. cmpxchg both reads and writes to eax, so it needs to be reassigned at each iteration of the loop.

The loop is needed because in order to perform a += 1 on an atomic float. The compiler has to keep trying until it manages to do the read-increment-write sequence fast enough that the atomic doesn't change in the meantime.

  • 1
    `you can't share a register across threads` Intermediate results need not be shared to other threads. Anyways stack is not shared with other threads.. (I have edited my question to mark exactly the lines which I am seeing redundant. Please check it.) – Sourav Kannantha B Jul 06 '21 at 15:13
  • @SouravKannanthaB sharing a pointer to a stack allocated object to some other thread is perfectly feasible. As far as intermediate results go, I'll see if I can update the answer to reflect the revised question. –  Jul 06 '21 at 15:17
  • @SouravKannanthaB I have revised the answer –  Jul 06 '21 at 15:46
  • 1
    Still `[rsp - 4]` is redundant.. Although `eax` is modified in loop, it is actually the required modification.. Remove the redundant lines as mentioned and analyze. You will get exact same data flow. – Sourav Kannantha B Jul 06 '21 at 16:12
  • @SouravKannanthaB: That's correct; GCC's "recipe" for using integer `cmpxchg` on `float` data apparently misleads it into doing redundant stores instead of just `movd`. Does it help to use `-march=haswell` or something, instead of the default generic tuning? See also [Atomic double floating point or SSE/AVX vector load/store on x86\_64](https://stackoverflow.com/q/45055402) for some code-gen for pure-load and pure-store (even more dumb since it insists on using integer loads/stores instead of atomic SSE2 pure loads.) My answer there also comments on the useless store/reload in an RMW. – Peter Cordes Jul 06 '21 at 17:04
  • This answer seems to be answering a different question, about the existence of the whole loop, not just the stores/loads instead of `movd`. – Peter Cordes Jul 06 '21 at 17:04