The strong-ness of x86 store instruction wrt. SC-DRF?

Question

I read about Herb's atomic<> Weapons talk and had a question about page 42:

He mentioned that (50:00 in the video):

(x86) stores are much stronger than they need to be...

What I don't understand is: if the x86 "S" on the chart is a plain store, i.e. mov, I don't think it's stronger than SC-DRF because it's only a release store plus total store order (and that's why you need an xchg for a SC store). But if it means an SC store, i.e. xchg, it should fall on the "fully SC" bar because it's effectively a full barrier. How should I take this x86 "S"'s strong-ness on the chart?

(SC-DRF is a guarantee of Sequentially Consistent execution for Data Race Free programs, as long as they don't use any atomics with orders weaker than std::memory_order_seq_cst. ISO C++ and Java, and other languages, provide this.)

@ThomasMatthews it stands for "Sequential Consistency for Data-Race-Free program". You can watch Herb's "atomic<> Weapons" talk for more information. — zanmato, Dec 06 '21 at 17:57

Peter Cordes · Accepted Answer · 2021-12-06T21:33:46.750

Yes, he's showing xchg there (full barrier and an RMW operation), not just a mov store - a plain mov would be below the SC-DRF bar because it doesn't provide sequential consistency on its own without mfence or other barrier.

Compare ARM64 stlr / ldar - they can't reorder with each other (not even StoreLoad), but stlr can reorder with other later operations, except of course other release-store operations, or some fences. (Like I mentioned in answer to your previous question). See also Does STLR(B) provide sequential consistency on ARM64? re: interaction with ldar for SC vs. ldapr for just acquire / release or acq_rel. Also Possible orderings with memory_order_seq_cst and memory_order_release for another example of how AArch64 compiles (without ARMv8.3 LDAPR).

But x86 seq_cst stores drain the store buffer on the spot, even if there is no later seq_cst load, store, or RMW in the same thread. This lack of reordering with later non-SC or non-atomic loads/stores is what makes it stronger (and more expensive) than necessary.

Herb Sutter explained this earlier in the video, at around 36:00. He points out xchg is stronger than necessary, not just an SC-release that can one-way reorder with later non-SC operations. "So what we have here, is overkill. Much stronger than is necessary" at 36:30

(Side note: right around 36:00, he mis-spoke: he said "we're not going to use these first 3 guarantees" (that x86 doesn't reorder loads with loads or stores with stores, or stores with older loads). But those guarantees are why SC load can be just a plain mov. Same for acq/rel being just plain mov for both load and store. That's why as he says, lfence and sfence are irrelevant for std::atomic.)

So anyway, ARM64 can hit the sweet spot with no extra barrier instructions, being exactly strong enough for seq_cst but no stronger. (ARMv8.3 with ldapr is slightly stronger than acq_rel requires, e.g. ARM64 still forbids IRIW reordering, but only a few machines can do that in practice, notably POWER)

Other ISAs with both L and S below the bar need extra barriers as part of their seq_cst load and seq_cst store recipes (https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html).

Thanks for the explanation. I think I get the "they are stronger than they need to be" part. My last confusion is that if we were talking about `xchg`, wouldn't it be on the very top, i.e., fully SC, of the chart? — zanmato, Dec 07 '21 at 03:27
@zanmato: Yeah, I don't know why x86 `xchg` wouldn't right up there with the fence at the top, fully SC. Good point. It's exactly as strong as what most C++ implementations use for `atomic_thread_fence(seq_cst)`, a `lock addl $0, (%rsp)`. ([xchg is equivalent to mfence for everything except weakly-ordered movntdqa loads from WC memory on some microarchitectures](//stackoverflow.com/q/40409297). But std:atomic leaves it up to the programmer to manually sfence or mfence after using weakly-ordered stores, and many implementations do use a locked op, not mfence, for atomic_thread_fence). — Peter Cordes, Dec 07 '21 at 03:38

The strong-ness of x86 store instruction wrt. SC-DRF?

1 Answers1

Linked