Understanding sequential consistency fence in C++

Question

I was reading about memory orders in C++. I could understand relaxed and acquire-release models a well. But I'm struggling with sequential-consistency.

If I am not wrong, from cppreference, std::memory_order_seq_cst 'operation' is equivalent to:

an acquire operation plus a single total order when the operation is 'load'.
a release operation plus a single total order when the operation is 'store'.
an acq-rel operation plus a single total order when the operation is 'read-modify-write'.

But what is the case with std::memory_order_seq_cst 'fence'? Is it equivalent to which of these?

an acquire fence plus a single total order.
a release fence plus a single total order.
an acq-rel fence plus a single total order.

If it is equivalent to one of above, what about the other two?

As far as I know, if it is case 1 (acquire fence), compilers would be free to move any write operation from above the fence to below it. Similarly, if it is case 2 (release fence), compilers would be free to move any read operation from below the fence to above it. Finally, if it is case 3 (acq-rel fence), compilers would be disallowed to move any instructions across the fence. Is this correct?

I'm still in a lot of confusion. Above statements may be incorrect. Please correct me where I'm wrong.

Note that, I'm talking about atomic fences and not about atomic operations. — Sourav Kannantha B, Feb 27 '23 at 18:35
The standard calls it "a sequentially consistent acquire and release fence" (§7.17.4.1) — Homer512, Feb 27 '23 at 20:54
@Homer512 then there is no 'sequentially consistent acquire fence' and 'sequentially consistent release fence'? Or does these two have no use cases? — Sourav Kannantha B, Feb 28 '23 at 01:45
`std::atomic_thread_fence( std::memory_order )` has to use one of the standard memory_order values; there aren't different flavours of `seq_cst`, there's only `std::memory_order_seq_cst`. I don't think there'd be much use for a fence that was for example part of the SC total order but which allowed non-SC loads to reorder with it, only ordering stores. Also not cheaper on many machines; an SC fence has to include a StoreLoad barrier, the most expensive type, might as well do them all: https://preshing.com/20120913/acquire-and-release-semantics/ — Peter Cordes, Feb 28 '23 at 08:40
@PeterCordes Is `seq_cst` fence a complete-full fence? Means no operations can cross it? Or can loads below it cross above and stores above it can cross below? — Sourav Kannantha B, Feb 28 '23 at 14:44
@SouravKannanthaB Nothing in the spec (or the std committee attempt at making a useful spec) ever mentions "crossing" or "allow reordering". A fence is an imaginary line in semantics, crossing it is immaterial. But you mean: can a memory op A, that is before a fence X in program order, run after an memory op B after that X? The spec isn't about the orders of executions of memory ops, it's about visibility of memory ops and "happens-before" (or "HB"), so "runs after" is immaterial. You need some "inter thread observable behavior" to ask any question re: threads (not real observables). — curiousguy, Feb 28 '23 at 16:32
(I admit I'm still working on understanding cst_seq.) I can't make sense of your Q; like, at all. What do total order and release fence have anything to do with one another? Could you try to write some code showing what I call "interthread obs" (observables) = things one thread observe about memory ops of another - not real complete program observables like I/O? The way semantics of linear (non MT) programs is defined is by external obs, not memory actions; the way MT program parts are defined is by interthread obs (then complete obs). — curiousguy, Feb 28 '23 at 16:41
As @curiousguy says, the C++ formalism is defined in terms of syncs-with and happens-before. The way CPUs and compilers implement those semantics is coherent shared cache, with local reordering allowed for accesses to it. https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ (And on e.g. PowerPC, [store-forwarding between logical cores sharing a physical core before commit to L1d cache, creating IRIW reordering](https://stackoverflow.com/questions/27807118/will-two-atomic-writes-to-different-locations-in-different-threads-always-be-see).) — Peter Cordes, Feb 28 '23 at 20:38
To implement fence(seq_cst) in the normal mapping from C++ to asm (https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html), you need to block all reordering, it's a full barrier. Unlike any other fence strength, it needs to keep an earlier relaxed store from reordering with a later relaxed load. (i.e. block StoreLoad reordering, like I said.) It also does all the things, LoadStore, LoadLoad and StoreStore. — Peter Cordes, Feb 28 '23 at 20:42
LL / LS / SS without StoreLoad would just be `fence(acq_rel)`, which x86 compilers can do for free by just blocking compile-time reordering. But note you still can't just let earlier stores cross it and reorder with later stores, even acq_rel and release fences have to be a two-way barrier for stores. See https://preshing.com/20131125/acquire-and-release-fences-dont-work-the-way-youd-expect/ . (a followup to https://preshing.com/20130922/acquire-and-release-fences/) — Peter Cordes, Feb 28 '23 at 20:47
The no-StoreLoad reordering requirement comes from the fact that the "single total order" of all SC operations *has to be consistent with program order* (sequenced-before). I forget the formal details of how SC fences interact with that. But the key thing you left out of your question is that the total order has to be some interleaving across threads of program order. — Peter Cordes, Feb 28 '23 at 20:51
@curiousguy As much as I try to come with a C++ example, I couldn't. Perhaps there is fault in the question itself. My idea was that, there are three memory fences - `acq`, `rel` and `acq-rel`. By adding the constraint of single total order to each of these, you would get three more. But I don't have any example to show this. — Sourav Kannantha B, Feb 28 '23 at 21:33
@PeterCordes Those articles were great. But I always keep forgetting this concept or get confused. And as you said, I totally forgot about program order being a thing. :) — Sourav Kannantha B, Feb 28 '23 at 21:35
@SouravKannanthaB You seem positively confused. There is no such thing as an "order" of acquire/release, tt wouldn't make sense. In a "100% acq/rel program" (where each read is acq, each write is rel, hence each RMW is acq+rel), *adding acq or rel or acq+rel fences do nothing*. Rel = publication of accomplishment (completed ops), acq = observation of a rel. Accomplishment is finishing a memory access (write AND read); it's often used transitively in a chain. Seq_cst is something else, it concerns distinct vars with memory ops that don't observe each others, like two writes to diff atomics. — curiousguy, Mar 04 '23 at 03:41
A,B,C,D being memory ops; X,Y being fences. "_Is seq_cst fence a complete-full fence? Means no operations can cross it?_" IOW: can "A X B" sometimes act like "B X A". These can only be distinguished by "observations" from another thread (say doing "C Y D"), so the Q involves multiple fences. [Unlike a program usually doing "hello, world" but "world, hello" at times (`cout` is an observable, from outside).] I will now call "inter-thread observable" just "observables". — curiousguy, Mar 04 '23 at 05:54
... While most writes are obviously observables (there a before and an after), they can write back the same value (f.ex. `a+=0;`) and reads can't be observed directly. But after the fact, a read can be seen has not being seq-cst with any history of memory actions. When ppl say "ops can be reordered", they mean "for any observations by other threads, the reordering isn't observable". — curiousguy, Mar 04 '23 at 06:08
... The relevant parts of cppreference that say a seq_cst fence can be used to observe the effect of another are "_For a pair of atomic operations on M called A and B, where A writes and B reads M's value, if there are two memory_order_seq_cst std::atomic_thread_fences X and Y, and if A is sequenced-before X, Y is sequenced-before B, and X appears before Y in the Single Total Order, then B observes either_" and "_or, there are memory_order_seq_cst std::atomic_thread_fences X and Y such that A is sequenced-before X, Y is sequenced-before B, and X appears before Y in the Single Total Order._" — curiousguy, Mar 04 '23 at 06:13
I try to generalize rules at many times. Sometimes these go wrong and I get confused as in this case. Sure I will try my best to understand things clearly. Thanks for these explanations. I will try to follow. — Sourav Kannantha B, Mar 04 '23 at 19:14
@curiousguy also, do you think should I close/delete this question as this may be potentially misleading. — Sourav Kannantha B, Mar 04 '23 at 19:23
@SouravKannanthaB I don't think you need to delete the Q; everyone is "struggling" with a least some stuff in thread programming, and what seq_cst means in addition to acq/rel is a plausible Q (others might have the same Q). — curiousguy, Mar 05 '23 at 04:37

score 2 · Accepted Answer · answered Apr 23 '23 at 03:32

In short, it's case 3.

A seq_cst fence includes all the functionality of an acq_rel fence. See [atomics.fences p5.5]:

is a sequentially consistent acquire and release fence, if order == memory_order::seq_cst.

So in particular it is an acquire and release fence. And it is also a sequentially consistent fence, meaning that it is included in the total order S defined in [atomics.order p4], in a manner consistent with the rules stated there.

The formal C++ memory model does not have a concept of "reordering". However, you are right that a typical implementation of a seq_cst fence would be one which prevents loads and stores from being reordered across it in either direction. Note that this is strictly stronger than an acq_rel fence, which would permit a store before the fence to be reordered with a load after the fence (StoreLoad reordering).

Eugene Zavidovsky · Answer 2 · 2023-04-26T11:32:25.570

This is what I think the authors of the standard intended for C_17 and C++20. My version of the abstract machine (see below). The proof of logical equivalence of the old and a new abstract machine is not included though, sorry.

Regarding your question please read rule 13 below. You will notice the (very simple) difference. It is obvious when rules are simplified like that. That difference comes from this idiomatic example, which shows that acq_rel combo-fence is not enough to avoid x == 0 && y == 0, so a seq_cst fence with special properties (namely a full memory barrier, forget about the "single order") is needed:

// Without reordering
// thread A
    std::atomic<int> a {0}, x {0};
    extern std::atomic<int> b;
    a.store(1, memory_order_relaxed);
    std::atomic_thread_fence(memory_order_acq_rel);
    if (b.load(memory_order_relaxed) == 1) {
        x.store(1, memory_order_relaxed);
    }
    /* ... */
// thread B
    std::atomic<int> b {0}, y {0};
    extern std::atomic<int> a;
    b.store(1, memory_order_relaxed);
    std::atomic_thread_fence(memory_order_acq_rel);
    if (a.load(memory_order_relaxed) == 1) {
        y.store(1, memory_order_relaxed);
    }
    /* ... */

// With the reordering
// thread A
    std::atomic<int> a {0}, x {0}, c;
    extern std::atomic<int> b;
    std::atomic_thread_fence(memory_order_acquire);
    c = b.load(memory_order_relaxed);
    a.store(1, memory_order_relaxed);
    std::atomic_thread_fence(memory_order_release);
    if (c == 1) {
        x.store(1, memory_order_relaxed);
    }
    /* ... */
// thread B
    std::atomic<int> b {0}, y {0}, d;
    extern std::atomic<int> a;
    std::atomic_thread_fence(memory_order_acquire);
    d = a.load(memory_order_relaxed);
    b.store(1, memory_order_relaxed);
    std::atomic_thread_fence(memory_order_release);
    if (d == 1) {
        y.store(1, memory_order_relaxed);
    }
    /* ... */

Mathematicians are welcomed. Jokers too. Due respect to Jeff Preshing and his series of the articles about lock-free programming!

Multi-threaded executions and data races (new version to n2346 C_17)

Under a hosted implementation, a program can have more than one thread of execution (or thread) running concurrently. The execution of each thread proceeds as defined by the remainder of this document. The execution of the entire program consists of an execution of all of its threads. Under a freestanding implementation, it is implementation-defined whether a program can have more than one thread of execution.
If there are two accesses to the same memory location, one of which is a writing and both are not ordered by the "happens before" relation defined below, the behavior is undefined.
Note: an access is an action that happens in the execution environment, see rules n2346 (3.1, 5.1).
For any possible execution of the entire program, for any atomic object, any pair A, B of accesses to it, "A happens before B" xor "B happens before A". If there is also an access C to the atomic object, such that "A happens before C" and "C happens before B", then "A happens before B".
Note: the proof of the rule consists of: a) proving the absence of any consequences to the introduction of the converse rules of the cache coherence rules of the old abstract machine to the new one, thus extending the meaning (definition) of "happens before" relation; b) consolidating those converse rules into a single rule.
X, T, U are some accesses in the following rules. In the following rules all accesses may happen on different memory locations and may be evaluated in different threads, unless otherwise specified.
A simply inter-thread happens before B, if an access A to an atomic object happens before an access B to the same object, but which is evaluated in another thread.
A subexpression A carries a dependency to a subexpression B if:

the value of A is used as an operand of B, unless: • B is an invocation of the kill_dependency macro, or • A is the left operand of a && or || operator, or • A is the left operand of a ?: operator, or • A is the left operand of a , operator; or
A writes to a memory location, B reads from it, and A is sequenced before B.

Note: the "unless"-exceptions, which breaks "carries a dependency to" chain through operands of subexpressions, are chosen by the authors of the standard for the convenience of a programmer, no hardware limitations were implied here. Looks like.
A is consume-ordered before B if:

a release writing A simply inter-thread happens before a consume reading B; or
an access A is sequenced before X and X is consume-ordered before B; or
A is consume-ordered before X and X carries a dependency to an access B.

Note: an access X carries dependency to B, iff the subexpressions, which X, B are evaluated because of, do the same.
A simply synchronizes with B, if a writing A simply inter-thread happens before a reading B.
A relacq-synchronizes with B if:

a release writing A simply synchronizes with an acquire reading B; or
a release fence A is sequenced before T and T simply synchronizes with an acquire reading B; or
a release writing A simply synchronizes with U and U is sequenced before an acquire fence B; or
a release fence A is sequenced before T, T simply synchronizes with U and U is sequenced before an acquire fence B; or
a seqcst fence A is sequenced before T and T simply inter-thread happens before an acquire reading B; or
a release writing A simply inter-thread happens before U and U is sequenced before a seqcst fence B; or
a seqcst fence A is sequenced before T, T simply inter-thread happens before U and U is sequenced before a seqcst fence B.

A is sync-ordered before B if:

A relacq-synchronizes with B; or
an access A is sequenced before X and X relacq-synchronizes with B; or
A relacq-synchronizes with X and X is sequenced before an access B; or
a seqcst access A is sequenced before a seqcst access C and C variously inter-thread happens before a seqcst access B; or
a seqcst access A variously inter-thread happens before a seqcst access C and C is sequenced before a seqcst access B.

Note: rules n2346 (7.17.3\3, 9-11; 7.17.4\3, 4) of the old abstract machine do not prohibit reordering of a reading (writing) before (after) a "fence" with a seqcst writing (reading) in another thread, the new abstract machine proudly does that and even more as it may have been intended by the authors of the standard. I may be wrong of course, so let mathematicians correct me.
A variously inter-thread happens before B if:

A simply inter-thread happens before B; or
A is consume-ordered before B; or
A is sync-ordered before B; or
A variously inter-thread happens before X and X variously inter-thread happens before B.

A happens before B, if A is sequenced before B or A variously inter-thread happens before B.
A is a visible writing to a memory location M with respect to a reading B from M, iff:

A happens before B, and
there is no other writing X to M such that A happens before X and X happens before B.

B reads the value written by A, if A is a visible writing to a reading B.
A reading from an atomic object reads the value from only one writing to that object.
Note: that and also rule 4 implicate that writing to be a visible writing.
Note: the declaration of an object (even without initializer) specifies an initialization writing to that object.
The implementation should not introduce a writing, which is not specified by any statement in a program, to any object declared in the program.

Multi-threaded executions and data races (new version to n4860 C++20)

Rules 1-13 are the same as the new rules 1-13 for C_17 above. Rules 17-23 are the same as the new rules 17-23 for C_17 above.

A is sync-ordered before B if:

an access A is sequenced before X and X relacq-synchronizes with B; or
A relacq-synchronizes with X and X is sequenced before an access B; or
a seqcst access A is sequenced before a seqcst access C and C strongly inter-thread happens before a seqcst access B; or
a seqcst access A strongly inter-thread happens before a seqcst access C and C is sequenced before a seqcst access B.

A strongly inter-thread happens before B if:

A simply inter-thread happens before B; or
A is sync-ordered before B; or
A strongly inter-thread happens before X and X strongly inter-thread happens before B.

A variously inter-thread happens before B if:

A strongly inter-thread happens before B; or
A relacq-synchronizes with B; or
A is consume-ordered before B; or
A variously inter-thread happens before X and X variously inter-thread happens before B.

This is all very interesting but I do not see how it answers the question in any way. — Nate Eldredge, Apr 20 '23 at 01:21
@NateEldredge I have looked more attentively and indeed not only the answer is not clear, but I have found the mistake of not specifying the ability of seq_cst fences to couple with ordinary release and acquire accesses. Now I have rewritten everything... Thank you for your interest! — Eugene Zavidovsky, Apr 21 '23 at 16:57
Well, I have more questions, but I still don't think this is a good answer to the question at hand, and so I really do not think this answer's comments are the right place to discuss further. Have you written this up and posted it anywhere else? — Nate Eldredge, Apr 23 '23 at 03:22
@NateEldredge That is not the shortest answer, yeah... If you want to, you may write to my mdzar5___gmail. — Eugene Zavidovsky, Apr 23 '23 at 16:12

Understanding sequential consistency fence in C++

2 Answers2

Linked