What is the opposite of a "full memory barrier"?

Question

I sometimes see the term "full memory barrier" used in tutorials about memory ordering, which I think means the following:

If we have the following instructions:

instruction 1
full_memory_barrier
instruction 2

Then instruction 1 is not allowed to be reordered to below full_memory_barrier, and instruction 2 is not allowed to be reordered to above full_memory_barrier.

But what is the opposite of a full memory barrier, I mean is there something like a "semi memory barrier" that only prevent the CPU from reordering instructions in one direction?

If there is such a memory barrier, I don't see its point, I mean if we have the following instructions:

instruction 1
memory_barrier_below_to_above
instruction 2

Assume that memory_barrier_below_to_above is a memory barrier that prevents instruction 2 from being reordered to above memory_barrier_below_to_above, so the following will not be allowed:

instruction 2
instruction 1
memory_barrier_below_to_above

But the following will be allowed (which makes this type of memory barrier pointless):

memory_barrier_below_to_above
instruction 2
instruction 1

If you search for `acquire/release fence`, you should find a lot of information on one-way barriers that *are* useful. In particular, `acqiure`ing a lock doesn't require a full barrier, because it's ok for previous operations to also be protected by the lock accidentally. Similarly, `release`ing a lock doesn't require a full barrier, because it's ok for subsequent operations to slip into the locked region. There are even *more* relaxed memory models in use, for example `RCU` in the Linux kernel. You can find further info there. — EOF, Jun 24 '18 at 21:12
@EOF acquire and release fences have a purpose because they are associated with another operation (for example: an acquire fence can be associated with a read operation, and a release fence can be associated with a write operation). But is there a one-way barrier that is not associated with any operation? — user8426277, Jun 24 '18 at 21:21
An `acquire-fence` or `acquire-barrier` basically turns the previous load(s) into `load-aquire`, a `release-fence` basically turns the previous store(s) into `store-relase` (it's not *quite* that simple, but for a first approximation it's ok). — EOF, Jun 24 '18 at 21:25
Actually `acquire` and `release` are usually associated with specific operations are are not usually "fences" so to speak. I.e., you have a release-store or an acquire-load. The C++ memory model does have a standalone acquire/release fences, but these are perhaps just confusingly named and actual use is dominated by acquire/release tied to specific operations. Hardware fences pretty much never use acquire/release terminology. @EOF — BeeOnRope, Jun 25 '18 at 02:33
The short answer is a "full barrier" generally means something like `mfence` on x86 or `dmb` on ARM which blocks reordering of all memory operations in both directions. So something that is not a full barrier has a weaker effect, e.g., allowing some types of reordering. — BeeOnRope, Jun 25 '18 at 02:38

Peter Cordes · Accepted Answer · 2018-06-26T02:10:42.453

http://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ explains different kinds of barriers, like LoadLoad or StoreStore. A StoreStore barrier only prevents stores from reordering across the barrier, but loads can still execute out of order.

On real CPUs, any barriers that include StoreLoad block everything else, too, and thus are called "full barriers". StoreLoad is the most expensive kind because it means draining the store buffer before later loads can read from L1d cache.

Barrier examples:

           strong               weak
x86        mfence               none needed unless you're using NT stores
ARM        dmb sy               isb,  dmb st, dmb ish, etc.
POWER      hwsync               lwsync, isync, ...

ARM has "inner" and "outer shareable domains". I don't really know what that means, haven't had to deal with it, but this page documents the different forms of Data Memory Barrier available. dmb st only waits for earlier stores to complete, so I think it's only a StoreStore barrier, and thus too weak for a C++11 release-store which also needs to order earlier loads against LoadStore reordering. See also C/C++11 mappings to processors: note that seq-cst can be achieved with full-barriers around every store, or with barriers before loads as well as before stores. Making loads cheap is usually best, though.

ARM ISB flushes the instruction caches. (ARM doesn't have coherent i-cache, so after writing code to memory, you need an ISB before you can reliably jump there and execute those bytes as instructions.)

POWER has a large selection of barriers available, including Light-Weight (non-full barrier) and Heavy-Weight Sync (full barrier) mentioned in Jeff Preshing's article linked above.

A one-directional barrier is what you get from a release-store or an acquire-load. A release-store at the end of a critical section (e.g. to unlock a spinlock) has to make sure loads/stores inside the critical section don't appear later, but it doesn't have to delay later loads until after the lock=0 becomes globally visible.

Jeff Preshing has an article about this, too: Acquire and Release semantics

The "full" vs. "partial" barrier terminology is not usually used for the one-way reordering restriction of a release-store or acquire-load. An actual release fence (in C++11, std::atomic_thread_fence(std::memory_order_release)) does block reordering of stores in both directions, unlike a release-store on a specific object.

This subtle distinction has caused confusion in the past (even among experts!). Jeff Preshing has yet another excellent article explaining it: Acquire and Release Fences Don't Work the Way You'd Expect.

You're right that a one-way barrier that wasn't tied to a store or a load wouldn't be very useful; that's why such a thing doesn't exist. :P It could reorder an unbounded distance in one direction and leave all the operations to reorder with each other.

What exactly does atomic_thread_fence(memory_order_release) do?

C11 (n1570 Section 7.17.4 Fences) only defines it in terms of creating a synchronizes-with relationship with an acquire-load or acquire fence, when the release-fence is used before an atomic store (relaxed or otherwise) to the same object the load accesses. (C++11 has basically the same definition, but discussion with @EOF in comments brought up the C11 version.)

This definition in terms of the net effect, not the mechanism for achieving it, doesn't directly tell us what it does or doesn't allow. For example, subsection 3 says

3) A release fence A synchronizes with an atomic operation B that performs an acquire operation on an atomic object M if there exists an atomic operation X such that A is sequenced before X, X modifies M, and B reads the value written by X or a value written by any side effect in the hypothetical release sequence X would head if it were a release operation

So in the writing thread, it's talking about code like this:

stuff           // including any non-atomic loads/stores

atomic_thread_fence(mo_release)  // A
M=X                              // X
  // threads that see load(M, acquire) == X also see stuff

The syncs-with means that threads which see the value from M=X (directly or indirectly through a release-sequence) also see all the stuff and read non-atomic variables without Data Race UB.

This lets us say something about what is / isn't allowed:

It's a 2-way barrier for atomic stores. They can't cross it in either direction, so the barrier's location in this thread's memory order is bounded by atomic stores before and after. Any earlier store can be part of stuff for some M, any later store can be the M that an acquire-load (or load + acquire-fence) synchronizes with.

It's a one-way barrier for atomic loads: earlier ones need to stay before the barrier, but later ones can move above the barrier. M=X can only be a store (or the store part of a RMW).

It's a one-way barrier for non-atomic loads/stores: non-atomic stores can be part of the stuff, but can't be X because they're not atomic. It's ok to allow later loads / stores in this thread to appear to other threads before the M=X. (If a non-atomic variable is modified before and after the barrier, then nothing could safely read it even after a syncs-with this barrier, unless there's also a way for a reader to stop this thread from continuing on and creating Data Race UB. So a compiler can and should reorder foo=1; fence(release); foo=2; into foo=2; fence(release);, eliminating the dead foo=1 store. But sinking foo=1 to after the barrier is only legal on the technicality that nothing could tell the difference without UB.)

As an implementation detail, a C11 release fence may be stronger than this (e.g. a 2-way barrier for more kinds of compile-time reordering), but not weaker. On some architectures (like ARM), the only option that's strong enough might be a full barrier asm instruction. And for compile-time reordering restrictions, a compiler might not allow these 1-way reorderings just to keep the implementation simple.

Mostly this combined 2-way / 1-way nature only matters for compile-time reordering. CPUs don't make the distinction between atomic vs. non-atomic stores. Non-atomic is always the same asm instruction as relaxed atomic (for objects that fit in a single register).

CPU barrier instructions that make a core wait until earlier operations are globally visible are typically 2-way barriers; they're specified in terms of operations becoming globally visible in a coherent view of memory shared by all cores, rather than the C/C++11 style of creating syncs-with relations. (Beware that operations can potentially become visible to some other threads before they become globally visible to all threads: Will two atomic writes to different locations in different threads always be seen in the same order by other threads?. But with just barriers against reordering within a physical core, sequential consistency can be recovered.)

A C++11 release-fence needs LoadStore + StoreStore barriers, but not LoadLoad. A CPU that lets you get just those 2 but not all 3 of the "cheap" barriers would let loads reorder in one direction across the barrier instruction while blocking stores in both directions.

Weakly-ordered SPARC is in fact like this, and uses the LoadStore and so on terminology (that's where Jeff Preshing took the terminology for his articles). http://blog.forecode.com/2010/01/29/barriers-to-understanding-memory-barriers/ shows how they're used. (More recent SPARCs use a TSO (Total Store Order) memory model. I think this is like x86, where the hardware gives the illusion of memory ops happening in program order except for StoreLoad reordering.)

Your last sentence is questionable: the point is that the fence is *not* "tied to" any particular load or store, but rather to the sets of preceding stores or subsequent loads. — EOF, Jun 24 '18 at 21:54
@EOF: If such a fence was one-way and wasn't tied to a load or store, the fence itself could reorder an unbounded distance in one direction and leave relaxed operations together to reorder with each other. Go read http://preshing.com/20131125/acquire-and-release-fences-dont-work-the-way-youd-expect/, it talks about this for release fences specifically. A standalone release *fence* is not a "release operation"; it's a 2-way fence that separates later from earlier, which is necessary to make all the later relaxed stores release-stores wrt. to all stuff before the barrier. — Peter Cordes, Jun 24 '18 at 22:54
Alternatively, *you* could read what the C11draft standard has to say about it. C11 draft standard n1570: *7.17.4 Fences* — EOF, Jun 25 '18 at 18:20
@EOF: [n1570 7.14.4](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf) point 2 talks about a `fence(release); m = x;` as an alternative to `m.store(x, release)` for providing a synchronizes-with relationship with an acquire-load or a relaxed load with an acquire fence. If the fence could reorder with earlier stores, they could also reorder with `m=x` and lose the synchronizes-with. Or if the fence could reorder with the `m=x` store, again the `m=x` could reorder with earlier stores, no longer providing a syncs-with. This is what Jeff Preshing's article explained for C++11. — Peter Cordes, Jun 25 '18 at 18:38
@EOF: `m.store(x,mo_release)` is different because reordering that operation in one direction takes the store and the barrier later. It doesn't separate earlier stores from later stores, only earlier stores (and loads) from *itself*. **But a stand-alone barrier has no single store that it keeps ordered while allowing others to cross.** That's why it can't do its job if it reorders in either direction. — Peter Cordes, Jun 25 '18 at 18:43
The point is that the fence is not associated with *any particular* memory operation, but *sets* of operations. After `T1: {set of stores 1} release {set of stores 2}` `T2: {set of loads 1} acquire {set of loads 2}`, if in `T2` *any* of the loads from set 1 sees a value written in set 2 `T1`, it is guaranteed to see all of the writes in set 1 of `T1` in reads in set 2 of `T2`. — EOF, Jun 25 '18 at 20:00
@EOF: right, and `atomic_thread_fence(mo_release)` has to be a 2-way barrier in the asm (and for compile-time reordering) to create that effect. The asm model of ordering the global visibility of operations is a different model from the C11/C++11 model of providing syncs-with relationships, and this question is asking about the asm ordering model that compilers have to use to produce the required effects on normal hardware. (Or if not a different model, then a different way of describing the same thing; I'm not sure. C11 is about the result without nailing down how you get it.) — Peter Cordes, Jun 25 '18 at 20:15
Even in C11 language, I'd still describe a release fence as a 2-way fence, because stores can't cross it in either direction. Loads can cross it in 1 direction, though. (So in SPARC asm terminology adopted by Jeff Preshing for his memory barriers article, it has to be a StoreStore + LoadStore barrier, but not StoreLoad or LoadLoad.) — Peter Cordes, Jun 25 '18 at 20:20
@EOF: I looked at this more closely, and there's definitely a 2-way component for `fence(release)`, but there's only a 1-way requirement for loads or for non-atomic operations. Updated my answer. I still say my original answer wasn't *wrong* (a purely 1-way barrier would be useless), but there is interesting stuff to be said. — Peter Cordes, Jun 26 '18 at 02:11
I didn't say it was wrong, I said it was questionable. Anyway, as usual I'm impressed with the depth and detail of your answer. On the ARM `dmb i/o/n/sh/sy` issue, you're not the only one who doesn't understand it. Even in the tiny field of people who deal with this stuff, that part of the ARM memory model is obscure. At some point ARM finally clarified that they expect an operating system environment (possibly virtual?) to be an inner shareable domain. So for everyone who isn't writing hypervisor or device code, `dmb ish` should be sufficient. — EOF, Jun 26 '18 at 16:02
@PeterCordes You mentioned, `StoreLoad is the most expensive kind because it means draining the store buffer before later loads can read from L1d cache`. Does it mean that one thread issues a full memory barrier (`mfence`) for example, all cores will have to drain the store buffer before later loads can be executed? Or such draining only happens to the core that receives the mfence? — HCSF, Oct 25 '19 at 06:07
@HCSF: memory barriers are local! Every (logical) core has its own store buffer. `mfence` works by stalling the core that executes it until its own store buffer drains, that is all. (Potentially allowing later non-memory uops to still execute, but [Skylake can't even do that[([Are loads and stores the only instructions that gets reordered?](//stackoverflow.com/a/50496379)) because a microcode update to work around an erratum added an lfence-like barrier to it.) — Peter Cordes, Oct 25 '19 at 06:13
For the curious, although all x86 memory barrier _instructions_ (and all others I'm aware of) are local, there is a type of non-local memory barrier, as implemented by the [membarrier(2)](http://man7.org/linux/man-pages/man2/membarrier.2.html) call on Linux. This is acts as a "remote" memory barrier and allows all the cost synchronization to be moved onto one of the actors in some cases (where there is an inherent asymmetry in the sync protocol). — BeeOnRope, Oct 25 '19 at 06:31
@BeeOnRope: yup, good point. If you *do* want a global memory barrier, the implementation mechanism Linux `membarrier(2)` uses is (IIRC) to broadcast an Inter-Processor Interrupt (IPI) which indirectly results in serializing all other cores. So that's a useful contrast to how `mfence` works. It has a cost for all cores, but its only paid at the time it's actually done, not every time potential readers read something. — Peter Cordes, Oct 25 '19 at 06:35
@PeterCordes - yeah I think there are a couple of different mechanisms although I haven't checked lately what's implemented. The IPI way is the fast one, but has a high cost while another option is to wait until each running thread/process experiences a context switch or interrupt. — BeeOnRope, Oct 25 '19 at 06:39

What is the opposite of a "full memory barrier"?

1 Answers1

Linked