How std::memory_order_XXX works

Question

I don't understand how std::memory_order_XXX(like memory_order_release/memory_order_acquire ...) works.

From some documents, it shows that these memory mode have different feature, but I'm really confused that they have the same assemble code, what determined the differences?

That code:

    static std::atomic<long> gt;
     void test1() {
          gt.store(1, std::memory_order_release);
          gt.store(2, std::memory_order_relaxed);
          gt.load(std::memory_order_acquire);
          gt.load(std::memory_order_relaxed);
     }

Corresponds to:

        00000000000007a0 <_Z5test1v>:
         7a0:   55                      push   %rbp
         7a1:   48 89 e5                mov    %rsp,%rbp
         7a4:   48 83 ec 30             sub    $0x30,%rsp

**memory_order_release:
         7a8:   48 c7 45 f8 01 00 00    movq   $0x1,-0x8(%rbp)
         7af:   00 
         7b0:   c7 45 e8 03 00 00 00    movl   $0x3,-0x18(%rbp)
         7b7:   8b 45 e8                mov    -0x18(%rbp),%eax
         7ba:   be ff ff 00 00          mov    $0xffff,%esi
         7bf:   89 c7                   mov    %eax,%edi
         7c1:   e8 b1 00 00 00          callq  877 <_ZStanSt12memory_orderSt23__memory_order_modifier>
         7c6:   89 45 ec                mov    %eax,-0x14(%rbp)
         7c9:   48 8b 55 f8             mov    -0x8(%rbp),%rdx
         7cd:   48 8d 05 44 08 20 00    lea    0x200844(%rip),%rax        # 201018 <_ZL2gt>
         7d4:   48 89 10                mov    %rdx,(%rax)
         7d7:   0f ae f0                mfence** 

**memory_order_relaxed:
         7da:   48 c7 45 f0 02 00 00    movq   $0x2,-0x10(%rbp)
         7e1:   00 
         7e2:   c7 45 e0 00 00 00 00    movl   $0x0,-0x20(%rbp)
         7e9:   8b 45 e0                mov    -0x20(%rbp),%eax
         7ec:   be ff ff 00 00          mov    $0xffff,%esi
         7f1:   89 c7                   mov    %eax,%edi
         7f3:   e8 7f 00 00 00          callq  877 <_ZStanSt12memory_orderSt23__memory_order_modifier>
         7f8:   89 45 e4                mov    %eax,-0x1c(%rbp)
         7fb:   48 8b 55 f0             mov    -0x10(%rbp),%rdx
         7ff:   48 8d 05 12 08 20 00    lea    0x200812(%rip),%rax        # 201018 <_ZL2gt>
         806:   48 89 10                mov    %rdx,(%rax)
         809:   0f ae f0                mfence** 

**memory_order_acquire:
         80c:   c7 45 d8 02 00 00 00    movl   $0x2,-0x28(%rbp)
         813:   8b 45 d8                mov    -0x28(%rbp),%eax
         816:   be ff ff 00 00          mov    $0xffff,%esi
         81b:   89 c7                   mov    %eax,%edi
         81d:   e8 55 00 00 00          callq  877 <_ZStanSt12memory_orderSt23__memory_order_modifier>
         822:   89 45 dc                mov    %eax,-0x24(%rbp)
         825:   48 8d 05 ec 07 20 00    lea    0x2007ec(%rip),%rax        # 201018 <_ZL2gt>
         82c:   48 8b 00                mov    (%rax),%rax**

**memory_order_relaxed:
         82f:   c7 45 d0 00 00 00 00    movl   $0x0,-0x30(%rbp)
         836:   8b 45 d0                mov    -0x30(%rbp),%eax
         839:   be ff ff 00 00          mov    $0xffff,%esi
         83e:   89 c7                   mov    %eax,%edi
         840:   e8 32 00 00 00          callq  877 <_ZStanSt12memory_orderSt23__memory_order_modifier>
         845:   89 45 d4                mov    %eax,-0x2c(%rbp)
         848:   48 8d 05 c9 07 20 00    lea    0x2007c9(%rip),%rax        # 201018 <_ZL2gt>
         84f:   48 8b 00                mov    (%rax),%rax**

         852:   90                      nop
         853:   c9                      leaveq 
         854:   c3                      retq   

        00000000000008cc <_ZStanSt12memory_orderSt23__memory_order_modifier>:
         8cc:   55                      push   %rbp
         8cd:   48 89 e5                mov    %rsp,%rbp
         8d0:   89 7d fc                mov    %edi,-0x4(%rbp)
         8d3:   89 75 f8                mov    %esi,-0x8(%rbp)
         8d6:   8b 55 fc                mov    -0x4(%rbp),%edx
         8d9:   8b 45 f8                mov    -0x8(%rbp),%eax
         8dc:   21 d0                   and    %edx,%eax
         8de:   5d                      pop    %rbp
         8df:   c3                      retq

I expect different memory mode has different implements on assemble code, but setting different mode value is no effect on assemble, who can explain this?

`_ZStanSt12memory_orderSt23__memory_order_modifier` means `std::operator&(std::memory_order, std::__memory_order_modifier)` — curiousguy, Jun 15 '19 at 02:48
[Interactive version](https://godbolt.org/z/zcueGw); numbers changed to make searching asm code easier. — curiousguy, Jun 15 '19 at 02:54
The asm code you posted is essentially useless moving values back and forth, unoptimized GCC garbage. At decent optimization level there is one asm instr per C++ line. — curiousguy, Jun 15 '19 at 03:01
@Taoozh: "*I expect different memory mode has different implements on assemble code*" You shouldn't. How an implementation implements the standard is up to the implementation. You should not try to learn C++ by reading the assembly from some piece of C++ and trying to divine what the implementation was trying to accomplish. — Nicol Bolas, Jun 15 '19 at 03:02

score 3 · Answer 1 · answered Jun 14 '19 at 07:19

3

Each memory model setting has its semantics. Compiler is obliged to satisfy this semantics, meaning that:

It disallows compiler to perform certain optimizations, such as reordering of reads and writes.
It instructs the compiler to propagate the very same message down to the hardware. How it is done, depends on the platform. x86_64 itself provides very strong memory model. Hence in almost all cases you will see no difference in generated assembler code for x86_64 no matter what memory model you choose. However, on RISC architectures (e.g. ARM), you will see the difference because compiler will have to insert memory barriers. Type of memory barrier depends on the selected memory model setting.

EDIT: Have a look at the JSR-133. It is very old and is about Java, but it provides the nicest explanation about memory model from the compiler perspective that I know. In particular, look at the table of memory barrier instructions for different architectures.

answered Jun 14 '19 at 07:19

gudok

4,029
2
20
30

Careful w/ Java/C++ comparisons: Java volatile has stronger semantics then acquire-release C++ atomic ops. Also there is no C++ equivalent for the semantics of regular fields in Java objects. – curiousguy Jun 14 '19 at 09:17
Almost all: specifically **`mo_seq_cst` pure-stores will be different**. (`xchg` or `mov` + `mfence` for seq-cst vs. plain `mov` for anything weaker). Compile-time reordering of *other* nearby non-atomic operations can also vary, regardless of atomic RMW on x86 always needing to use a `lock` prefix (full barrier). e.g. `a=1; atomic_op; a=2;` can optimize to just `a=2; atomic_op` if it's relaxed. – Peter Cordes Jun 23 '19 at 00:20
@PeterCordes: Is this true for 4 byte ints on x86? When optimized, void f(std::atomic& i) { i.load(); ... } just does an indirect mov from the address in rdi. – Andrew Jun 10 '22 at 14:19
@Andrew: Yes, my comment was specifically about x86 (including x86-64). Your test confirms that `seq_cst` loads are *not* different from non-atomic loads, like I said. Only `i.store(1)` would compile to more instructions than `i.store(1, mo_release)`. https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html . See also [How do memory\_order\_seq\_cst and memory\_order\_acq\_rel differ?](https://stackoverflow.com/a/58043923) and [C++ How is release-and-acquire achieved on x86 only using MOV?](https://stackoverflow.com/q/60314179). Also – Peter Cordes Jun 10 '22 at 17:49

score 3 · Answer 2 · answered Jun 15 '19 at 03:10

Given the code:

#include <atomic>

static std::atomic<long> gt;

void test1() {
    gt.store(41, std::memory_order_release);
    gt.store(42, std::memory_order_relaxed);
    gt.load(std::memory_order_acquire);
    gt.load(std::memory_order_relaxed);
}

At decent optimization level there is no garbage assembly moving values around on registers than the stack:

test1():
        movq    $41, gt(%rip)
        movq    $42, gt(%rip)
        movq    gt(%rip), %rax
        movq    gt(%rip), %rax
        ret

We see that the exact same code is generated for the different memory orders; although testing different instructions in the same function in sequence is very bad practice as C++ instructions don't have to be compiled independently and context might influence code generation. But with the current code generation in GCC, it compiles each statement involving an atomic as its own. Good practice is to have a different function for each statement.

The same code is generated here because no special instruction happens to be needed for these memory orders.

Oh weird, the relaxed load isn't optimized away even though the value is unused. Compilers are *really* simplistic/cautious about not optimizing atomics! — Peter Cordes, Jun 23 '19 at 00:22

How std::memory_order_XXX works

2 Answers2