Compare and swap C++0x

Question

From the C++0x proposal on C++ Atomic Types and Operations:

29.1 Order and Consistency [atomics.order]

Add a new sub-clause with the following paragraphs.

The enumeration memory_order specifies the detailed regular (non-atomic) memory synchronization order as defined in [the new section added by N2334 or its adopted successor] and may provide for operation ordering. Its enumerated values and their meanings are as follows.

memory_order_relaxed

The operation does not order memory.

memory_order_release

Performs a release operation on the affected memory locations, thus making regular memory writes visible to other threads through the atomic variable to which it is applied.

memory_order_acquire

Performs an acquire operation on the affected memory locations, thus making regular memory writes in other threads released through the atomic variable to which it is applied, visible to the current thread.

memory_order_acq_rel

The operation has both acquire and release semantics.

memory_order_seq_cst

The operation has both acquire and release semantics, and in addition, has sequentially-consistent operation ordering.

Lower in the proposal:

bool A::compare_swap( C& expected, C desired,
        memory_order success, memory_order failure ) volatile

where one can specify memory order for the CAS.

My understanding is that “memory_order_acq_rel” will only necessarily synchronize those memory locations which are needed for the operation, while other memory locations may remain unsynchronized (it will not behave as a memory fence).

Now, my question is - if I choose “memory_order_acq_rel” and apply compare_swap to integral types, for instance, integers, how is this typically translated into machine code on modern consumer processors such as a multicore Intel i7? What about the other commonly used architectures (x64, SPARC, ppc, arm)?

In particular (assuming a concrete compiler, say gcc):

How to compare-and-swap an integer location with the above operation?
What instruction sequence will such a code produce?
Is the operation lock-free on i7?
Will such an operation run a full cache coherence protocol, synchronizing caches of different processor cores as if it were a memory fence on i7? Or will it just synchronize the memory locations needed by this operation?
Related to previous question - is there any performance advantage to using acq_rel semantics on i7? What about the other architectures?

Thanks for all the answers.

"_From the C++0x proposal on C++ Atomic Types and Operations:_" The text you quoted is a really, really bad explanation. — curiousguy, Dec 08 '19 at 21:40

score 8 · Accepted Answer · edited Dec 10 '19 at 04:07

The answer here is not trivial. Exactly what happens and what is meant is dependent on many things. For basic understanding of cache coherence/memory perhaps my recent blog entries might be helpful:

But that aside, let me try to answer a few questions. First off the below function is being very hopeful as to what is supported: very fine-grained control over exactly how strong a memory-order guarantee you get. That's reasonable for compile-time reordering but often not for runtime barriers.

compare_swap( C& expected, C desired,
        memory_order success, memory_order failure )

Architectures won't all be able to implement this exactly as you requested; many will have to strengthen it to something strong enough that they can implement. When you specify memory_order you are specifying how reordering may work. To use Intel's terms you will be specifying what type of fence you want, there are three of them, the full fence, load fence, and store fence. (But on x86, load fence and store fence are only useful with weakly-ordered instructions like NT stores; atomics don't use them. Regular load/store give you everything except that stores can appear after later loads.) Just because you want a particular fence on that operation won't mean it is supported, in which I'd hope it always falls back to a full fence. (See Preshing's article on memory barriers)

An x86 (including x64) compiler will likely use the LOCK CMPXCHG instruction to implement the CAS, regardless of memory ordering. This implies a full barrier; x86 doesn't have a way to make a read-modify-write operation atomic without a lock prefix, which is also a full barrier. Pure-store and pure-load can be atomic "on their own", with many ISAs needing barriers for anything above mo_relaxed, but x86 does acq_rel "for free" in asm.

This instruction is lock-free, although all cores trying to CAS the same location will contend for access to it so you could argue it's not really wait-free. (Algorithms that use it might not be lock-free, but the operation itself is wait-free, see wikipedia's non-blocking algorithm article). On non-x86 with LL/SC instead of locked instructions, C++11 compare_exchange_weak is normally wait-free but compare_exchange_strong requires a retry loop in case of spurious failure.

Now that C++11 has existed for years, you can look at the asm output for various architectures on the Godbolt compiler explorer.

In terms of memory sync you need to understand how cache-coherence works (my blog may help a bit). New CPUs use a ccNUMA architecture (previously SMP). Essentially the "view" on the memory never gets out-of-sync. The fences used in the code don't actually force any flushing of cache to happen per-se, only of the store buffer committing in flight stores to cache before later loads.

If two cores both have the same memory location cached in a cache-line, a store by one core will get exclusive ownership of the cache line (invalidating all other copies) and marking its own as dirty. A very simple explanation for a very complex process

To answer your last question you should always use the memory semantics that you logically need to be correct. Most architectures won't support all the combinations you use in your program. However, in many cases you'll get great optimizations, especially in cases where the order you requested is guaranteed without a fence (which is quite common).

-- Answers to some comments:

You have to distinguish between what it means to execute a write instruction and write to a memory location. This is what I attempt to explain in my blog post. By the time the "0" is committed to 0x100, all cores see that zero. Writing integers is also atomic, that is even without a lock, when you write to a location all cores will immediately have that value if they wish to use it.

The trouble is that to use the value you have likely loaded it into a register first, any changes to the location after that obviously won't touch the register. This is why one needs mutexes or atomic<T> despite a cache coherent memory: the compiler is allowed to keep plain variable values in private registers. (In C++11, that's because a data-race on non-atomic variables is Undefined Behaviour.)

As to contradictory claims, generally you'll see all sorts of claims. Whether they are contradictory comes right down to exactly what "see" "load" "execute" mean in the context. If you write "1" to 0x100, does that mean you executed the write instruction or did the CPU actually commit that value. The difference created by the store buffer is one major cause of reordering (the only one x86 allows). The CPU can delay writing the "1", but you can be sure that the moment it does finally commit that "1" all cores see it. The fences control this ordering by making the thread wait until a store commits before doing later operations.

I should also add that an explicit fence instruction will not usually be used. The "lock" semantics, implicit locking of some functions, and ordering guarantees are usually enough. — edA-qa mort-ora-y, Nov 18 '10 at 11:36
Thank you for your detailed answer. 1) My greatest concern with locks is thread preemption. Since `lock` in `lock cmpxchg` is not really a lock, but a semantics annot., `lock cmpxchg` executes "at once". 2) The next thing that troubles me is that `lock cmpxchg` actually flushes the buffer to memory - judging from what you've said and written on the blog, this doesn't happen on new CPUs. 3) Furthermore, it seems to me that there is less contention when 2 cores execute atomic operations on 2 separate, distant memory locations, since there is no reloading. Is this correct? Nice blog posts, btw. — axel22, Nov 18 '10 at 13:32
You wrote: "If two cores both have the same memory location cached in a cache-line, one will get marked dirty and the other will reload as necessary.", and a similar thing in your blog. On the other hand, in this question http://stackoverflow.com/questions/4213639/compare-and-swap-in-machine-code-in-c, a user claims: "But, if A does an ordinary write of "0" to address 0x100, then B writes "1" to 0x100, and then they both C&S on address 0x200 -- afterwards they will both see the same value at 0x200, but A might still think that 0x100 contains "0"." Aren't the two claims contradictory? — axel22, Nov 18 '10 at 14:18
The last comment, assuming of course, that by saying that cache-line reloading occurs, you meant it occurs with ordinary loads and stores, not those marked as atomic. — axel22, Nov 18 '10 at 14:20
Most of the basic operations don't flush to memory. Unless you explicitly tell the CPU to do so it won't generally flush to memory until it is a good time to do so -- and that won't likely interfere with your program. — edA-qa mort-ora-y, Nov 18 '10 at 15:09
"Flush" suggests that the cache is emptied and values written in RAM. Almost no program needs to do that. Except Spectre code! — curiousguy, Dec 08 '19 at 21:05
I made some edits to clear up some mis-statements and link some of Jeff Preshing's articles about memory barriers. Most importantly that `mo_relaxed` still needs a `lock` prefix; it's not *just* for ordering, it's also necessary for atomicity of the whole read-modify-write! But also that this feels very x86 centric and didn't give a good explanation of barriers other than a full barrier. (Although I didn't really fix that.) But yes, cache is coherent; people often explain memory ordering with nonsense like cache holding conflicting values, rather than a store buffer delaying commit to L1d. — Peter Cordes, Dec 10 '19 at 04:08

score 1 · Answer 2 · edited Dec 10 '19 at 03:33

Your whole worldview seems off base: your question insinuates that cache consistency is controlled by memory orders at the C++ level and fences or atomic operations at the CPU level.

But cache consistency is one of the most important invariants for the physical architecture, and it's provided at all time by the memory system that consists of the interconnection of all CPUs and the RAM. You can never beat it from code running on a CPU, or even see its detail of operation. Of course, by observing RAM directly and running code elsewhere you might see stale data at some level of memory: by definition the RAM doesn't have the newest value of all memory locations.

But code running on a CPU can't access DRAM directly, only through the memory hierarchy which includes caches that communicate with each other to maintain coherency of this shared view of memory. (Typically with MESI). Even on a single core, a write-back cache lets DRAM values be stale, which can be an issue for non-cache-coherent DMA but not for reading/writing memory from a CPU.

So the issue exists only for external devices, and only ones that do non-coherent DMA. (DMA is cache-coherent on modern x86 CPUs; the memory controller being built-in to the CPU makes this possible).

Will such an operation run a full cache coherence protocol, synchronizing caches of different processor cores as if it were a memory fence on i7?

They are already synchronized. See Does a memory barrier ensure that the cache coherence has been completed? - memory barriers only do local things inside the core running the barrier, like flush the store buffer.

Or will it just synchronize the memory locations needed by this operation?

An atomic operation applies to exactly one memory location. What others locations do you have in mind?

On a weakly-ordered CPU, a memory_order_relaxed atomic increment could avoid making earlier loads/stores visible before that increment. But x86's strongly-ordered memory model doesn't allow that.

Compare and swap C++0x

29.1 Order and Consistency [atomics.order]

2 Answers2