C++ How is release-and-acquire achieved on x86 only using MOV?

Question

This question is a follow-up/clarification to this:

Does the MOV x86 instruction implement a C++11 memory_order_release atomic store?

This states the MOV assembly instruction is sufficient to perform acquire-release semantics on x86. We do not need LOCK, fences or xchg etc. However, I am struggling to understand how this works.

Intel doc Vol 3A Chapter 8 states:

https://software.intel.com/sites/default/files/managed/7c/f1/253668-sdm-vol-3a.pdf

In a single-processor (core) system....

Reads are not reordered with other reads.

Writes are not reordered with older reads.

Writes to memory are not reordered with other writes, with the following exceptions:

but this is for a single core. The multi-core section does not seem to mention how loads are enforced:

In a multiple-processor system, the following ordering principles apply:

Individual processors use the same ordering principles as in a single-processor system.

Writes by a single processor are observed in the same order by all processors.

Writes from an individual processor are NOT ordered with respect to the writes from other processors.

Memory ordering obeys causality (memory ordering respects transitive visibility).

Any two stores are seen in a consistent order by processors other than those performing the stores

Locked instructions have a total order.

So how can MOV alone can facilitate acquire-release?

Isn't `MOV` rather *sequentially consistent* by itself than putting `rel-acq` fences? Because It only gets reordered under very limited conditions. It reminds me of Herb Sutter's very insightful presentation of SC-DRF memory model a long ago. — Dean Seo, Feb 20 '20 at 07:04
@DeanSeo: no, x86's hardware memory model is SC + a store buffer with store forwarding. This is like acq_rel, not SC. — Peter Cordes, Feb 20 '20 at 07:16

Peter Cordes · Accepted Answer · 2021-10-19T07:01:22.547

9

but this is for a single core. The multi-core section does not seem to mention how loads are enforced:

The first bullet point in that section is key: Individual processors use the same ordering principles as in a single-processor system. The implicit part of that statement is ... when loading/storing from cache-coherent shared memory. i.e. multi-processor systems don't introduce new ways for reordering, they just mean the possible observers now include code on other cores instead of just DMA / IO devices.

The model for reordering of access to shared memory is the single-core model, i.e. program order + a store buffer = basically acq_rel. Actually slightly stronger than acq_rel, which is fine.

The only reordering that happens is local, within each CPU core. Once a store becomes globally visible, it becomes visible to all other cores at the same time, and didn't become visible to any cores before that. (Except to the core doing the store, via store forwarding.) That's why only local barriers are sufficient to recover sequential consistency on top of a SC + store-buffer model. (For x86, just mo_seq_cst just needs mfence after SC stores, to drain the store buffer before any further loads can execute. mfence and locked instructions (which are also full barriers) don't have to bother other cores, just make this one wait).

One key point to understand is that there is a coherent shared view of memory (through coherent caches) that all processors share. The very top of chapter 8 of Intel's SDM defines some of this background:

These multiprocessing mechanisms have the following characteristics:

To maintain system memory coherency — When two or more processors are attempting simultaneously to access the same address in system memory, some communication mechanism or memory access protocol must be available to promote data coherency and, in some instances, to allow one processor to temporarily lock a memory location.

To maintain cache consistency — When one processor accesses data cached on another processor, it must not receive incorrect data. If it modifies data, all other processors that access that data must receive the modified data.

To allow predictable ordering of writes to memory — In some circumstances, it is important that memory writes be observed externally in precisely the same order as programmed.

[...]

The caching mechanism and cache consistency of Intel 64 and IA-32 processors are discussed in Chapter 11.

(CPUs use some variant of MESI; Intel in practice uses MESIF, AMD in practice uses MOESI.)

The same chapter also includes some litmus tests that help illustrate / define the memory model. The parts you quoted aren't really a strictly formal definition of the memory model. But the section 8.2.3.2 Neither Loads Nor Stores Are Reordered with Like Operations shows that loads aren't reordered with loads. Another section also shows that LoadStore reordering is forbidden. Acq_rel is basically blocking all reordering except StoreLoad, and that's what x86 does. (https://preshing.com/20120913/acquire-and-release-semantics/ and https://preshing.com/20120930/weak-vs-strong-memory-models/)

how are barriers/fences and acquire, release semantics implemented microarchitecturally?
x86 mfence and C++ memory barrier - asking why no barriers are needed for acq_rel, but coming at it from a different angle (wondering about how data ever becomes visible to other cores).
How do memory_order_seq_cst and memory_order_acq_rel differ? (seq_cst requires flushing the store buffer).
C11 Atomic Acquire/Release and x86_64 lack of load/store coherence?
Globally Invisible load instructions program-order + store buffer isn't exactly the same as acq_rel, especially once you consider a load that only partially overlaps a recent store.
x86-TSO: A Rigorous and Usable Programmer’s Model for x86 Multiprocessors - a formal memory model for x86.

Other ISAs

In general, most weaker memory HW models also only allow local reordering so barriers are still only local within a CPU core, just making (some part of) that core wait until some condition. (e.g. x86 mfence blocks later loads and stores from executing until the store buffer drains. Other ISAs also benefit from light-weight barriers for efficiency for stuff that x86 enforces between every memory operation, e.g. blocking LoadLoad and LoadStore reordering. https://preshing.com/20120930/weak-vs-strong-memory-models/)

A few ISAs (only PowerPC these days) allow stores to become visible to some other cores before becoming visible to all, allowing IRIW reordering. Note that mo_acq_rel in C++ allows IRIW reordering; only seq_cst forbids it. Most HW memory models are slightly stronger than ISO C++ and make it impossible, so all cores agree on the global order of stores.

edited Oct 19 '21 at 07:01

answered Feb 20 '20 at 08:27

Peter Cordes

328,167
45
605
847

1

Hm, neat timing on the answers. This looks a bit better than mine. :) – GManNickG Feb 20 '20 at 08:30
@GManNickG: thanks. I like that your works through the implications of the individual ordering guarantees. It took me a while to come up with coherent shared memory as the piece of the puzzle the OP might be missing, and notice that the Intel manual doesn't really clearly make that point. It's easy to take for granted, until you run into people with misconceptions like stale values existing in cache. (Because of clumsy descriptions and misunderstandings of compilers "caching" copies of shared vars in *registers* (private, not coherent).) – Peter Cordes Feb 20 '20 at 08:58
@PeterCordes thanks Peter. May I ask you to elaborate upon your second paragraph, regarding the local core, store buffer and mfence? I don't understand that part because we confirmed no fences are required for acquire-release but then you mention an mfence. I'd like to understand this. – user997112 Feb 20 '20 at 16:39
1

@user997112: I mention `mfence` in the context of what's needed for sequential consistency (SC aka seq_cst) on x86. I mentioned it to point out that everything mfence does is local, within the core that executes it. Thanks for pointing out the possible confusion in how I explained that, I see it now; updated. – Peter Cordes Feb 20 '20 at 22:01
@PeterCordes Awesome. So is this a fair summary: on x86 when sharing 64-bits or less, acquire-release is sufficient. However, when sharing memory greater than 64-bits (or multiple non-contiguous regions of memory) stricter memory barriers are required? – user997112 Feb 20 '20 at 23:23
1

@user997112: Huh? No. acq-rel is about ordering of other loads/stores relative to this one. e.g. write a big buffer, then `data_ready.store(true, mo_release);`. A reader that does `data_ready.load(mo_acquire)` and sees `true` can then safely read the buffer, even if the buffer is non-atomic. If you only have one 64-bit shared variable, you don't need any ordering of anything else, just mo_relaxed for that one lock-free variable. – Peter Cordes Feb 21 '20 at 00:38
@PeterCordes Okay, understood. Under what circumstances would I need to use any of the other memory barriers on x86? This is what is confusing me. – user997112 Feb 21 '20 at 02:03
1

@user997112: other than mfence? The use-cases for SFENCE are only if you've used weakly-ordered NT stores and want to "release" them with a "data-ready=true". The use-cases for LFENCE are basically non-existent. Intel might have had plans to introduce weakly-ordered loads but never did so (except SSE4.1 movntdqa from WC memory, like video RAM). [When should I use \_mm\_sfence \_mm\_lfence and \_mm\_mfence](//stackoverflow.com/a/50780314). Of course normally you don't manually use barriers yourself, you let the compiler emit them for you for source that uses `std::atomic<>`. – Peter Cordes Feb 21 '20 at 02:42
@PeterCordes Sorry, i should have been more specific: Under what circumstances would I need to use any of the other memory orders (consume, acq_rel, seq_cst) on x86? – user997112 Feb 21 '20 at 04:57
2

@user997112: to get more performance than seq_cst when you don't need as much ordering. `mov` + `mfence` (or `xchg`) is pretty slow. Acquire and release are free at runtime, but relaxed can allow compile-time optimization of other operations around the atomic. (Atomic RMW operations on x86 are always a full barrier; seq_cst pure stores are the expensive thing.) In general, for maximum performance use as weak an order as strictly necessary. In general, for maximum safety against design mistakes, just use the default seq_cst, especially if you can't actually test your code on a weak ISA. – Peter Cordes Feb 21 '20 at 05:19
@PeterCordes I've probably asked my question badly: is it possible to give an example where seq_cst is needed on x86, where acquire and release wouldn't be enough? Second question: dare you implying acquire and release prevent compiler re-ordering? – user997112 Feb 21 '20 at 18:58
1

@user997112: oh. https://preshing.com/20120515/memory-reordering-caught-in-the-act/. You need seq_cst when you store and then want to load and see what other threads might see / have seen. And yes, compile-time reordering has to respect the ISO C++ memory model (not the HW memory model for cases where they differ, e.g. a relaxed store can be reordered at compile time, or an acquire load can reorder in one direction only at compile time, relative to relaxed and non-atomic operations. Even when compiling for x86, where in asm everything is an acquire load.) – Peter Cordes Feb 22 '20 at 03:42
@user997112 C/C++ = high level languages. You can only reason in term of "official" semantics, that is, the most common sense interpretation of the intent of the rules. Compilers even if they compile naively at one time will optimize more one day. Case in point: overflow of signed arithmetic is not roll over. It is undefined. Compilers will simplify arithmetic expression using math rules, which imply that sums of any positive numbers is positive. Even when compiling for common CPU which guarantee 2-complement roll over on overflow! – curiousguy Mar 07 '23 at 02:03
@curiousguy: The C++ formalism can be pretty abstract sometimes, especially the memory model. Unlike signed integer overflow (which is UB instead of implementation-defined for fairly arbitrary reasons), you can construct a litmus test with a difference between `acq_rel` vs. `seq_cst` without UB. They do map to a cache coherency + barriers model of what happens on real hardware. It's very reasonable to ask what hardware concept and feature (like `mfence`) is being modeled by different C++ things, as a way to start to get a handle on them. – Peter Cordes Mar 07 '23 at 02:08
@curiousguy: You're correct that this is not a way to prove correctness of something you want to do in C++, though. Although sometimes you can more easily prove that something *isn't* guaranteed in C++, e.g. by looking at how it would compile for AArch64 and seeing that it can fail there. (AArch64 is a good example of a HW memory model that's not much stronger than ISO C++ in a lot of cases, e.g. it can let seq_cst stores reorder with later operations that aren't also seq_cst). – Peter Cordes Mar 07 '23 at 02:13

score 5 · Answer 2 · answered Feb 20 '20 at 08:26

Refreshing the semantics of acquire and release (quoting cppreference rather than the standard, because it's what I have on hand - the standard is more...verbose, here):

memory_order_acquire: A load operation with this memory order performs the acquire operation on the affected memory location: no reads or writes in the current thread can be reordered before this load. All writes in other threads that release the same atomic variable are visible in the current thread

memory_order_release: A store operation with this memory order performs the release operation: no reads or writes in the current thread can be reordered after this store. All writes in the current thread are visible in other threads that acquire the same atomic variable

This gives us four things to guarantee:

acquire ordering: "no reads or writes in the current thread can be reordered before this load"
release ordering: "no reads or writes in the current thread can be reordered after this store"
acquire-release synchronization:
- "all writes in other threads that release the same atomic variable are visible in the current thread"
- "all writes in the current thread are visible in other threads that acquire the same atomic variable"

Reviewing the guarantees:

Reads are not reordered with other reads.

Writes are not reordered with older reads.

Writes to memory are not reordered with other writes [..]

Individual processors use the same ordering principles as in a single-processor system.

This is sufficient to satisfy the ordering guarantees.

For acquire ordering, consider a read of the atomic has occurred: for that thread, clearly any later read or write migrating before would violate the first or second bullet points, respectively.

For release ordering, consider a write of the atomic has occurred: for that thread, clearly any prior reads or write migrating after would violate the second or third bullet points, respectively.

The only thing left is to ensure that if a thread reads a released store, it will see all the other loads the writer thread had produced up to that point. This is where the other multi-processor guarantee is needed.

Writes by a single processor are observed in the same order by all processors.

This is sufficient to satisfy acquire-release synchronization.

We've already established that when the release write occurs, all other writes prior to it will have also occurred. This bullet point then ensures that if another thread reads the released write, it will read all the writes the writer produced up to that point. (If it does not, then it would be observing that single processor's writes in a different order than the single processor, violating the bullet point.)

C++ How is release-and-acquire achieved on x86 only using MOV?

2 Answers2

Other ISAs

Linked

Related