From the C++0x proposal on C++ Atomic Types and Operations:
29.1 Order and Consistency [atomics.order]
Add a new sub-clause with the following paragraphs.
The enumeration
memory_order
specifies the detailed regular (non-atomic) memory synchronization order as defined in [the new section added by N2334 or its adopted successor] and may provide for operation ordering. Its enumerated values and their meanings are as follows.
memory_order_relaxed
The operation does not order memory.
memory_order_release
Performs a release operation on the affected memory locations, thus making regular memory writes visible to other threads through the atomic variable to which it is applied.
memory_order_acquire
Performs an acquire operation on the affected memory locations, thus making regular memory writes in other threads released through the atomic variable to which it is applied, visible to the current thread.
memory_order_acq_rel
The operation has both acquire and release semantics.
memory_order_seq_cst
The operation has both acquire and release semantics, and in addition, has sequentially-consistent operation ordering.
Lower in the proposal:
bool A::compare_swap( C& expected, C desired, memory_order success, memory_order failure ) volatile
where one can specify memory order for the CAS.
My understanding is that “memory_order_acq_rel
” will only necessarily synchronize those memory locations which are needed for the operation, while other memory locations may remain unsynchronized (it will not behave as a memory fence).
Now, my question is - if I choose “memory_order_acq_rel
” and apply compare_swap
to integral types, for instance, integers, how is this typically translated into machine code on modern consumer processors such as a multicore Intel i7? What about the other commonly used architectures (x64, SPARC, ppc, arm)?
In particular (assuming a concrete compiler, say gcc):
- How to compare-and-swap an integer location with the above operation?
- What instruction sequence will such a code produce?
- Is the operation lock-free on i7?
- Will such an operation run a full cache coherence protocol, synchronizing caches of different processor cores as if it were a memory fence on i7? Or will it just synchronize the memory locations needed by this operation?
- Related to previous question - is there any performance advantage to using
acq_rel
semantics on i7? What about the other architectures?
Thanks for all the answers.