Do atomic CAS-operations on x86_64 and ARM always use std::memory_order_seq_cst?

Question

some_atomic.load(std::memory_order_acquire) does just drop through to a simple load instruction, and some_atomic.store(std::memory_order_release) drops through to a simple store instruction.

It is known that on x86 for the operations load() and store() memory barriers memory_order_consume, memory_order_acquire, memory_order_release, memory_order_acq_rel does not require a processor instructions.

But on ARMv8 we known that here are memory barriers both for load() and store(): http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2 http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-2-of-2

About different architectures of CPUs: http://g.oswego.edu/dl/jmm/cookbook.html

Next, but for the CAS-operation on x86, these two lines with different memory barriers are identical in Disassembly code (MSVS2012 x86_64):

    a.compare_exchange_weak(temp, 4, std::memory_order_seq_cst, std::memory_order_seq_cst);
000000013FE71A2D  mov         ebx,dword ptr [temp]  
000000013FE71A31  mov         eax,ebx  
000000013FE71A33  mov         ecx,4  
000000013FE71A38  lock cmpxchg dword ptr [temp],ecx  

    a.compare_exchange_weak(temp, 5, std::memory_order_relaxed, std::memory_order_relaxed);
000000013FE71A4D  mov         ecx,5  
000000013FE71A52  mov         eax,ebx  
000000013FE71A54  lock cmpxchg dword ptr [temp],ecx

Disassembly code compiled by GCC 4.8.1 x86_64 - GDB:

a.compare_exchange_weak(temp, 4, std::memory_order_seq_cst, std::memory_order_seq_cst);
a.compare_exchange_weak(temp, 5, std::memory_order_relaxed, std::memory_order_relaxed);

0x4613b7  <+0x0027>         mov    0x2c(%rsp),%eax
0x4613bb  <+0x002b>         mov    $0x4,%edx
0x4613c0  <+0x0030>         lock cmpxchg %edx,0x20(%rsp)
0x4613c6  <+0x0036>         mov    %eax,0x2c(%rsp)
0x4613ca  <+0x003a>         lock cmpxchg %edx,0x20(%rsp)

Is on x86/x86_64 platforms for any atomic CAS-operations, an example such like this atomic_val.compare_exchange_weak(temp, 1, std::memory_order_relaxed, std::memory_order_relaxed); always satisfied with the ordering std::memory_order_seq_cst?

And if the any CAS operation on the x86 always run with sequential consistency (std::memory_order_seq_cst) regardless of barriers, then on the ARMv8 it is the same?

QUESTION: Should the order of std::memory_order_relaxed for CAS block memory bus on x86 or ARM?

ANSWER: On x86 any compare_exchange_weak() operations with any std::memory_orders(even std::memory_order_relaxed) always translates to the LOCK CMPXCHG with lock bus, to be really atomic, and have equal expensive to XCHG - "the cmpxchg is just as expensive as the xchg instruction".

(An addition: XCHG equal to LOCK XCHG, but CMPXCHG doesn't equal to LOCK CMPXCHG(which is really atomic)

On ARM and PowerPC for any`compare_exchange_weak() for different std::memory_orders there are differents lock's processor instructions, through LL/SC.

Processor memory-barriers-instructions for x86(except CAS), ARM and PowerPC: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

@dzada That was about MSVS2012 bug: http://stackoverflow.com/questions/18576986/does-the-semantics-of-stdmemory-order-acquire-requires-processor-instruction But it is may be too :) — Alex, Sep 02 '13 at 16:57
I suspect this is part of the same bug as in the other question. You can't rely on a buggy compiler to produce better code for a near same operation. Compare with GCC. — Mats Petersson, Sep 02 '13 at 16:58
The first two paragraphs of this question are contradicted by the x86 code you show. There is a difference between a multi-CPU system and a single CPU. The **CAS** is a single atomic operation used for [tag:lock-free] programming. What is your question exactly? You know that **compare and exchange** is different than **load** and **store**. The last two are mainly concerned with a memory barrier for the single operation. **CAS** is both a **load**, a **store** and a **compare**. Are you asking why `cmpxchg` needs a lock? Why is this tagged **ARM**? — artless noise, Sep 02 '13 at 19:52
@artless noise You don't agree with quote by Anthony Williams, or don't agree with asm-code produced by MSVS2012? And my question, what must produce compiler for this string `a.compare_exchange_weak(temp, 5, std::memory_order_relaxed, std::memory_order_relaxed);` under x86 and under ARMv8? — Alex, Sep 02 '13 at 20:04
No, neither. Anthony Williams is probably correct. He talks about `load` and `store`. **CAS** is more than `load` **or** `store`. It is both with a compare. The **ARM** must use `ldrex` and `strex`. There are many SO questions with the **ARM** information. — artless noise, Sep 02 '13 at 20:07
@artless noise Does that **CAS** on **x86** and **ARM** always produce a `lock` for any `std::memory_order` even for `std::memory_order_relaxed`, and `std::memory_order` to CAS only affect reordering by compiler? I do not believe the compiler MSVS2012 because there are bugs. — Alex, Sep 02 '13 at 20:15

score 5 · Accepted Answer · answered Sep 03 '13 at 05:46

5

You shouldn't worry about what instructions the compiler maps a given C11 construct to as this doesn't capture everything. Instead you need to develop code with respect to the guarantees of the C11 memory model. As the above comment notes, your compiler or future compilers are free to reorder relaxed memory operations as long as it doesn't violate the C11 memory model. It is also a worthwhile running your code through a tool like CDSChecker to see what behaviors are allowed under the memory model.

answered Sep 03 '13 at 05:46

briand

280
1
4

1

I would not worry if it does not affect the performance in 10-100 times and if there would be not a compiler bug :) http://connect.microsoft.com/VisualStudio/feedback/details/770885 Should the order of `std::memory_order_relaxed` for `CAS` block memory bus on x86 or ARM? – Alex Sep 03 '13 at 09:25
2

In that case, you can look at the following: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html. – briand Sep 03 '13 at 10:29
Thank you very much, this is what you need! Except for the fact that, unfortunately, in the table for x86 there isn't `Cmpxchg Relaxed / Cmpxchg Acquire`. As I understand it in x86 always compiles to the `LOCK CMPXCHG`? – Alex Sep 03 '13 at 10:56
I would think so... x86 guarantees load-load ordering, store-store ordering, and load-store ordering... So one would expect that a RMW operation wouldn't be reordered with anything. – briand Sep 03 '13 at 16:53
@Alex and briand: I think the compiler emits `lock cmpxchg` because that's the only way to make sure the compare+exchange itself is actually atomic. If there was a way to do it that wasn't also a full memory barrier, `memory_order_relaxed` would produce it. `memory_order_relaxed` does still let the compiler hoist stores of *other* variables out of loops, and otherwise reorder memory access at compile time. – Peter Cordes Sep 04 '15 at 07:23
@Alex and briand: turned my comment into an answer. summary: The operation still has to be atomic, and `lock cmpxchg` is the best option. `memory_order_relaxed` should still allow compile-time reordering of memory accesses, including hoisting them out of loops. – Peter Cordes Sep 04 '15 at 08:03

score 2 · Answer 2 · answered Sep 03 '13 at 05:05

x86 guarantees that loads following loads are ordered, and stores following stores are ordered. Given that CAS requires both loading and storing, all operations have to be ordered around it.

However, it is worth noting that, in the presence of multiple atomics with memory_order_relaxed, the compiler is allowed to reorder them. It cannot do so with memory_order_seq_cst.

score 1 · Answer 3 · answered Sep 04 '15 at 08:00

I think the compiler emits lock cmpxchg even for memory_order_relaxed because that's the only way to make sure the compare+exchange itself is actually atomic. Like artless_noise said in comments, other architectures can use a Load Linked / Store Conditional to implement compare_exchange_weak(...).

memory_order_relaxed should still let the compiler hoist stores of other variables out of loops, and otherwise reorder memory access at compile time.

If there was a way to do it on x86 that wasn't also a full memory barrier, a good compiler would use it for memory_order_relaxed.

Do atomic CAS-operations on x86_64 and ARM always use std::memory_order_seq_cst?

3 Answers3

Linked