6

I'm trying to understand the purpose of std::atomic_thread_fence(std::memory_order_seq_cst); fences, and how they're different from acq_rel fences.

So far my understanding is that the only difference is that seq-cst fences affect the global order of seq-cst operations ([atomics.order]/4). And said order can only be observed if you actually perform seq-cst loads.

So I'm thinking that if I have no seq-cst loads, then I can replace all my seq-cst fences with acq-rel fences without changing the behavior. Is that correct?

And if that's correct, why am I seeing code like this "implementation Dekker's algorithm with Fences", that uses seq-cst fences, while keeping all atomic reads/writes relaxed? Here's the code from that blog post:

std::atomic<bool> flag0(false),flag1(false);
std::atomic<int> turn(0);

void p0()
{
    flag0.store(true,std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_seq_cst);

    while (flag1.load(std::memory_order_relaxed))
    {
        if (turn.load(std::memory_order_relaxed) != 0)
        {
            flag0.store(false,std::memory_order_relaxed);
            while (turn.load(std::memory_order_relaxed) != 0)
            {
            }
            flag0.store(true,std::memory_order_relaxed);
            std::atomic_thread_fence(std::memory_order_seq_cst);
        }
    }
    std::atomic_thread_fence(std::memory_order_acquire);
 
    // critical section


    turn.store(1,std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_release);
    flag0.store(false,std::memory_order_relaxed);
}

void p1()
{
    flag1.store(true,std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_seq_cst);

    while (flag0.load(std::memory_order_relaxed))
    {
        if (turn.load(std::memory_order_relaxed) != 1)
        {
            flag1.store(false,std::memory_order_relaxed);
            while (turn.load(std::memory_order_relaxed) != 1)
            {
            }
            flag1.store(true,std::memory_order_relaxed);
            std::atomic_thread_fence(std::memory_order_seq_cst);
        }
    }
    std::atomic_thread_fence(std::memory_order_acquire);
 
    // critical section


    turn.store(0,std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_release);
    flag1.store(false,std::memory_order_relaxed);
}
Nate Eldredge
  • 48,811
  • 6
  • 54
  • 82
HolyBlackCat
  • 78,603
  • 9
  • 131
  • 207
  • It looks like it's merely combining thread fences for store (release) and load (acquire) into a single full memory barrier, thus ensuring that the flag0 change is committed to memory before loading flag1 from memory. I don't imagine there would be a problem with replacing a seq-cst barrier with a rel-acq barrier pair. – Den-Jason Jan 04 '22 at 10:57
  • @Den-Jason Looking at the first fence in each function, it seems neither seq-cst nor acq-rel fence would do the right thing, and the only option is to (remove the fence and) use seq-cst for the first store and the first load. – HolyBlackCat Jan 04 '22 at 12:01
  • But I'm not sure if it's a problem in the real world, or just a weakness of C++ memory model formalism. – HolyBlackCat Jan 04 '22 at 12:02
  • Just for the record https://godbolt.org/z/odjaorGcq shows that GCC compiles `std::atomic_thread_fence(std::memory_order_acq_rel);` to zero instructions for x86, rather than an mfence or equivalent, so it's not preventing StoreLoad reordering. So if your hypothesis is right, that would mean nothing (?) in a program would need to block StoreLoad reordering across acq_rel fences to satisfy any of the ISO C++ requirements that apply to programs without SC loads. – Peter Cordes Jan 04 '22 at 12:43
  • Are you counting the read part of SC RMWs as part of "SC loads"? A failed `compare_exchange_weak` with seq_cst fail order, might be relevant, or might just be equivalent to fetch_add(0), I forget. – Peter Cordes Jan 04 '22 at 12:46
  • @PeterCordes I'm very new this, so I don't have a definitive answer. I think yes, RMWs might count. – HolyBlackCat Jan 04 '22 at 12:52
  • IIRC, Decker's algorithm relies on blocking StoreLoad reordering so threads can read back what's actually in memory, not store-forward their own view. If that code is truly safe and guaranteed by ISO C++'s formal rules, then the fences aren't equivalent. The alternative explanation is that it just happens to work when compiled to asm for real ISAs (by compilers that don't optimize atomics, and thus will emit a fence instruction for a `seq_cst` C++ fence). I forget the details of how ISO C++ rules connect SC fences to "happens-before"s that govern what other threads are allowed to observe :/ – Peter Cordes Jan 04 '22 at 12:57
  • @PeterCordes I think it's the latter, it just happens to work. But I wonder if I'm approaching this from the wrong side, trying to learn the standard's abstraction first, and not what it actually compiles to. – HolyBlackCat Jan 04 '22 at 13:07
  • From my understanding, for the purposes of "happens before", acq-rel and seq-cst operations are equivalent. Seq-cst only adds extra constraints on what seq-cst loads are allowed to see and in what order. – HolyBlackCat Jan 04 '22 at 13:08
  • 2
    I definitely found it helpful to understand HW memory models like x86's that are defined in terms of accesses to a shared coherency domain that truly exists whether any readers are looking, with reordering allowed or not. e.g. https://preshing.com/20120913/acquire-and-release-semantics/ (And the fact that some ISAs, like PowerPC, allow private forwarding of data between cores, creating [IRIW reordering](https://stackoverflow.com/a/50679223)). Then from there you can see how the C++ formalism is put together to give equivalent guarantees without reference to such details. – Peter Cordes Jan 04 '22 at 13:16
  • 2
    (Also see https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/.) One pitfall of that approach is that it took me a couple years to realize that a seq_cst store operation (not fence) doesn't *have* to be a full memory barrier on the spot, or to prevent reorder with later relaxed ops; ARM64 STLR's special interaction to not reorder with LDAR makes its asm `seq_cst` only just as strong as ISO C++ requires, not a lot stronger like most other ISAs, thus more efficient. See [The strong-ness of x86 store instruction wrt. SC-DRF?](https://stackoverflow.com/a/70252123) – Peter Cordes Jan 04 '22 at 13:19
  • Thanks, I'll take a look. – HolyBlackCat Jan 04 '22 at 13:22
  • So it appears the difference between seq-cst and acq-rel is simply that seq-cst is "global" when working with atomic variables. When it comes to using thread fences, surely acq-rel would effectively be global, since it's not concerned with any specific variables? Or is the compiler free to look at which variables are used around the fence and optimise accordingly? – Den-Jason Jan 04 '22 at 22:50
  • @Den-Jason At least from the standard's point of view, an acq-rel fence effectively elevates any preceding atomic reads (most notably relaxed ones) to acquires, and any following atomic writes to releases. So at least theoretically, it's not global. – HolyBlackCat Jan 04 '22 at 22:53
  • @Den-Jason: A fence separate *all* possible future accesses before / after in the same thread, not just this function, so being non-global and instead promoting certain ops to acquire or release is only a very theoretical possibility for any real code that involves branches or loops. – Peter Cordes Jan 05 '22 at 14:12

3 Answers3

3

Before C++20, the C++ standard contained the following paragraphs to define visibility in terms of the single total order S (see [atomics.order]):

For an atomic operation B that reads the value of an atomic object M, if there is a memory_order_seq_cst fence X sequenced before B, then B observes either the last memory_order_seq_cst modification of M preceding X in the total order S or a later modification of M in its modification order.

For atomic operations A and B on an atomic object M, where A modifies M and B takes its value, if there is a memory_order_seq_cst fence X such that A is sequenced before X and B follows X in S, then B observes either the effects of A or a later modification of M in its modification order.

For atomic operations A and B on an atomic object M, where A modifies M and B takes its value, if there are memory_order_seq_cst fences X and Y such that A is sequenced before X, Y is sequenced before B, and X precedes Y in S, then B observes either the effects of A or a later modification of M in its modification order.

However, WG21/P0668R5 proposed two changes that were accepted for C++20: first, it weakens some guarantees of the single total order (it no longer needs to be consistent with "happens-before"), and second (and more importantly for this question), it strengthens the guarantees provided by seq-cst fences. For that they replaced Section 32.4 [atomics.order] paragraphs 3-7 (the definition of SC ordering) with the wording referenced in your question. P0668R5 states:

This new wording takes a significantly different approach with respect to the sequential consistency ordering S: Instead of defining visibility in terms of S and the other orders in the standard, this essentially defines constraints on S in terms of visibility in a particular execution, as expressed by the coherence order. If these constraints are not satisfiable by any total order S, then the candidate execution which gave rise to the coherence order is not valid.

To be honest I find this new definition to be less clear, but I suppose I just need to read through it a few more times and try to wrap my head around it. :) However, since this is supposed to strengthen the guarantees, the visibility guarantees described in previous standards should still hold.

Seq-cst fences are also part of the single total order S, so any two seq-cst fences are always totally ordered. Suppose we have a write operation A that is sequenced before a seq-cst fence X, and a read operation B that is sequenced after a seq-cst fence Y. If X is ordered before Y in S, then B is guaranteed to observe the value written by A, or some later value (this is based on the last paragraph from the first quote). An acq-rel fence does not provide any such guarantees.

Acq-rel fences basically convert surrounding relaxed load/stores into acquire/release operations, but acq-rel only provides visibility guarantees for surrounding operations once an acquire-load synchronizes with a release-store. That is, only once the load has already observed the value stored. But it does not provide any guarantee when the store becomes visible.

mpoeter
  • 2,574
  • 1
  • 5
  • 12
  • I've seen this proposal when dealing with [this question](https://stackoverflow.com/q/70554277/2752075), but apparently I stopped reading before they got to fences. :) As for the new definition, try applying it to the example in Nate's answer (and check out the comments). – HolyBlackCat Jan 06 '22 at 18:33
  • So.... presumably an acq-rel fence is equivalent to an acquire-fence immediately followed by a release-fence? But a release-fence immediately followed by an acquire-fence would behave differently; equivalent to a seq-cst fence? – Den-Jason Jan 07 '22 at 19:21
  • @Den-Jason No, by combining acquire- and release-fences you can only get the same guarantees as an acq-rel-fence. Why do you think a release-fence immediately followed by an acquire-fence would behave differently? – mpoeter Jan 08 '22 at 18:29
  • @mpoeter because you'd be ensuring the stores are committed to memory before ensuring the loads are read through. Which is presumably a better scenario than the other way round. – Den-Jason Jan 09 '22 at 16:22
  • I think there needs to be some guide written somewhere which identifies intended scenarios for the different load/store/fence types, since the existing documentation is clearly not lucid enough. – Den-Jason Jan 09 '22 at 16:25
  • @Den-Jason basically an acquire-fence converts all atomic reads that are sequenced _after_ the fence into acquire operations. Likewise, the release-fence converts all atomic writes that are sequenced _before_ the fence into release operations. See [\[atomic.fences\]](http://eel.is/c++draft/atomics.fences). – mpoeter Jan 09 '22 at 18:11
  • 1
    It can be helpful to experiment a bit and see how these things are translated to specific assembly code to get a better understanding. However, in general I recommend to _avoid_ trying to reason about such translations, and instead focus on the definitions from the C++ standard. (Also see my answer here https://stackoverflow.com/questions/64382673/is-there-a-happens-before-relationship-between-the-use-of-a-stack-local-object-a/64393987?noredirect=1#comment124621648_64393987) – mpoeter Jan 09 '22 at 18:15
2

As I understand it, they're not the same, and a counterexample is below. I believe the error in your logic is here:

And said order can only be observed if you actually perform seq-cst loads.

I don't think that's true. In atomics.order p4 which defines the axioms of the sequential consistency total order S, items 2-4 all may involve operations which are not seq_cst. You can observe the coherence ordering between such operations, and this can let you infer how the seq_cst operations have been ordered.

As an example, consider the following version of the StoreLoad litmus test, akin to Peterson's algorithm:

std::atomic<bool> a,b;  // initialized to false

void thr1() {
    a.store(true, std::memory_order_seq_cst);
    std::atomic_thread_fence(std::memory_order_seq_cst);
    if (b.load(std::memory_order_relaxed) == false)
        std::cout << "thr1 wins";
}

void thr2() {
    b.store(true, std::memory_order_seq_cst);
    std::atomic_thread_fence(std::memory_order_seq_cst);
    if (a.load(std::memory_order_relaxed) == false)
        std::cout << "thr2 wins";
}

Note all the loads are relaxed.

I claim that if thr1 prints "thr1 wins", then we deduce that a.store(true) preceded b.store(true) in the sequential consistency order S.

To see this, let A be b.load() and B be b.store(true). If b.load() == false then we have that A is coherence-ordered before B. (Apply atomics.order p3.3 with A,B as above, and X the initialization of b to false.)

Now let X be the fence in thr1. Then X happens before A by sequencing, so X precedes B in S by atomics.order p4.3. That is, the thr1 fence precedes b.store(true). And a.store(true), which is also seq_cst, precedes the thr1 fence, because the store strongly happens before the fence by sequencing. So by transitivity, a.store(true) precedes b.store(true), as claimed.

Similarly, if thr2 prints, then b.store(true) precedes a.store(true). They can't both precede each other, so we have therefore proved that the program cannot print both messages.

If you downgrade the fences to acq_rel, the proof breaks down. In that case, as far as I can see, nothing prevents the program from printing thr1 wins even if b.store(true) precedes a.store(true) in the order S. As such, with acq_rel fences, I believe it is allowed for both threads to print. Though I'm not sure whether there is any real implementation where it could actually happen.


We can get an even simpler example if we make all the loads and stores relaxed, so that the only seq_cst operations are the fences. Then we can use (4.4) instead to show that if b.load(relaxed) returns false, the thr1 fence precedes the thr2 fence, and vice versa if a.load() returns false. We therefore conclude, as before, that the program cannot print both messages.

However, if we keep the loads and stores relaxed and weaken the fences to acq_rel, it is then more clear that we have lost this guarantee. Indeed, with a little prodding, a similar example actually fails on x86, where the acq_rel fence is a no-op because the loads/stores are already acquire/release. So that's a clearer case where a seq_cst fence really achieves something that acq_rel does not.

Nate Eldredge
  • 48,811
  • 6
  • 54
  • 82
  • Ok, I agree that if T1 prints `thr1 wins`, then the seq-cst order is `a = true`, T1 fence, `b = true`, T2 fence. But how does that order stop `a.load(relaxed)` in T2 from returning the old value of `a` before the assignment? – HolyBlackCat Jan 05 '22 at 16:04
  • 1
    @HolyBlackCat: Because if `a.load(relaxed)` in T2 returned `false`, the same logic would tell us that that the `seq_cst` order is instead `b = true`, T2 fence, `a = true`, T1 fence. It can't be both. – Nate Eldredge Jan 05 '22 at 16:06
  • @HolyBlackCat: Or to say it another way, if the seq-cst order is `a=true`, T1 fence, `b=true`, T2 fence, then the contrapositive of (4.3) says that `a.load(relaxed)` is not coherence-ordered before `a=true`. So by (3.3), `a.load` cannot read the value stored by the initialization X. The only other value it can read is the one stored by `a=true`, namely `true`. – Nate Eldredge Jan 05 '22 at 16:11
  • An acq_rel fence would definitely not be sufficient here; look at how it compiles for a real ISA like x86, where `fence(acq_rel)` compiles to zero asm instructions, not preventing StoreLoad reordering. The only question that doesn't answer is whether the the seq_cst fence preventing StoreLoad reordering on x86 is an implementation detail of it compiling to `mfence`, or if that's something the ISO standard actually guarantees. (I think the latter, although I didn't double-check your reasoning based on the standard. That sounds familiar from last I looked at the rules of how fences connect.) – Peter Cordes Jan 06 '22 at 09:09
0

To summarize the argument made by Nate Eldredge in their answer:


The global seq-cst order by itself doesn't seem to affect anything other than seq-cst loads, but the order must exist even if there are no loads.

Certain operation reorderings (like, in Nate's example, both threads entering the critical section) would impose contradicting constraints on the seq-cst order, and are therefore not allowed.

I'm not entirely sure if this effect can be observed without seq-cst fences (i.e. only with seq-cst stores), I think no.

HolyBlackCat
  • 78,603
  • 9
  • 131
  • 207
  • Interesting to note that without seq_cst fences or seq_cst loads, ARMv8.3 compiler output would only involve `stlr` release stores, but nothing that would ever stop it from reordering with later loads. (An `ldar` seq_cst load, or a full barrier instruction). ARMv8 seq_cst is generally described as only being as strong as ISO C++ seq_cst requires, not stronger / more expensive, and with ARMv8.3 `ldapr`, acq_rel is similarly just enough. (Although unlike PowerPC, it's guaranteed not to do IRIW reordering: different readers will never disagree on the order of stores done by two other cores.) – Peter Cordes Jan 06 '22 at 09:59