Why Sequential Semantic on x86/x86_64 is using through MOV [addr], reg + MFENCE instead of + SFENCE?

Question

At Intel x86/x86_64 systems have 3 types of memory barriers: lfence, sfence and mfence. The question in terms of their use. For Sequential Semantic (SC) is sufficient to use MOV [addr], reg + MFENCE for all memory cells requiring SC-semantics. However, you can write code in the whole and vice versa: MFENCE + MOV reg, [addr]. Apparently felt, that if the number of stores to memory is usually less than the loads from it, then the use of write-barrier in total cost less. And on this basis, that we must use sequential stores to memory, made another optimization - [LOCK] XCHG, which is probably cheaper due to the fact that "MFENCE inside in XCHG" applies only to the cache line of memory used in XCHG (video where on 0:28:20 said that MFENCE more expensive that XCHG).

http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

C/C++11 Operation x86 implementation

Load Seq_Cst: MOV (from memory)

Store Seq Cst: (LOCK) XCHG // alternative: MOV (into memory),MFENCE

Note: there is an alternative mapping of C/C++11 to x86, which instead of locking (or fencing) the Seq Cst store locks/fences the Seq Cst load:

Load Seq_Cst: LOCK XADD(0) // alternative: MFENCE,MOV (from memory)

Store Seq Cst: MOV (into memory)

The difference is that ARM and Power memory barriers interact exclusively with LLC (Last Level Cache), and x86 interact and with lower level caches L1/L2. In x86/x86_64:

lfence on Core1: (CoreX-L1) -> (CoreX-L2) -> L3-> (Core1-L2) -> (Core1-L1)
sfence on Core1: (Core1-L1) -> (Core1-L2) -> L3-> (CoreX-L2) -> (CoreX-L1)

In ARM:

ldr; dmb;: L3-> (Core1-L2) -> (Core1-L1)
dmb; str; dmb;: (Core1-L1) -> (Core1-L2) -> L3

C++11 code compiled by GCC 4.8.2 - GDB in x86_64:

std::atomic<int> a;
int temp = 0;
a.store(temp, std::memory_order_seq_cst);
0x4613e8  <+0x0058>         mov    0x38(%rsp),%eax
0x4613ec  <+0x005c>         mov    %eax,0x20(%rsp)
0x4613f0  <+0x0060>         mfence

But why on x86/x86_64 Sequential Semantic (SC) using through MOV [addr], reg + MFENCE, and not MOV [addr], reg + SFENCE, why do we need full-fence MFENCE instead of SFENCE there?

I think a store fence would only synchronize with other loads, not with other stores. Sequential consistency means that you want a *total* order that's observed by all CPUs, and a store fence wouldn't imply an ordering of multiple stores. — Kerrek SB, Sep 25 '13 at 10:21
@Kerrek This is true for ARM, but not for x86, since if we make SFENCE on the first CPU-core, then we no longer have to do LFENCE on the other CPU-core before access to this memory cell. Accordingly, if all the variables require sequential semantics (SC) we do SFENCE, and we do not need to have anywhere LFENCE. Or do you mean that MFENCE cancels reordering (out of order execution) in both directions in the processor pipeline? — Alex, Sep 25 '13 at 10:41
First and foremost I think I want to say that sfence alone cannot provide a *total* ordering that's obseved by all CPUs... — Kerrek SB, Sep 25 '13 at 11:07
@Kerrek SB Sequential semantic and total ordering that's observed by all CPUs are the synonyms. But question is why after each store-operation `SFENCE` cannot provide a total ordering that's observed by all CPUs, i.e. why we need to do `LFENCE` consisting in `MFENCE` after each store-operation (**not before load-operation**)? — Alex, Sep 25 '13 at 16:46
So, I think the following could happen. Suppose `X` and `Y` are zero. Now: `[Thread 1: STORE X = 1, SFENCE]`, `[Thread 2: STORE Y = 1, SFENCE]`, and in any other thread, do `[LFENCE, LOAD X, LOAD Y]`. Now one other thread could see `X = 1, Y = 0`, and another could see `X = 0, Y = 1`. The fences only tell you that *other, earlier* stores in Thread 1 have taken effect *if* you see `X = 1`. But there's no global order consistent with that. — Kerrek SB, Sep 25 '13 at 22:25
@Kerrek SB OK. But will have this case Sequential semantic? Suppose `X` and `Y` are zero. Now: `[Thread 1: STORE X = 1, MFENCE(LFENCE, SFENCE)]`, `[Thread 2: STORE Y = 1, MFENCE(LFENCE, SFENCE)]`, and in any other thread, do `[LOAD X, LOAD Y]`. Could now one other thread see `X = 1, Y = 0`, and could another see `X = 0, Y = 1`? (**NOTICE:** STORE + MFENCE(L/S) and LOAD (without any fences) uses in GCC 4.8.2 for std::memory_order_sqt_cst) — Alex, Sep 26 '13 at 09:45
With sequential consistency (implemented with mfences), you can imagine the events occuring in some sequential, global order, which is observed by everyone. So if you see `X = 1, Y = 0` somewhere, that means in your serialization the `STORE X = 1` comes first, so you cannot observe `X = 0, Y = 1` elsewhere. — Kerrek SB, Sep 26 '13 at 10:14
@Kerrek SB I know all that :) But that does not answer my question. If we can write for sequential consistency: STORE + LFENCE+SFENCE and LOAD (without fence), then how can help us the **LFENCE after STORE for** SC (sequential consistency)? LFENCE may make sense only before LOAD! — Alex, Sep 26 '13 at 10:41

score 2 · Answer 1 · answered Mar 19 '19 at 01:25

sfence doesn't block StoreLoad reordering. Unless there are any NT stores in flight, it's architecturally a no-op. Stores already wait for older stores to commit before they themselves commit to L1d and become globally visible, because x86 doesn't allow StoreStore reordering. (Except for NT stores / stores to WC memory)

For seq_cst you need a full barrier to flush the store buffer / make sure all older stores are globally visible before any later loads. See https://preshing.com/20120515/memory-reordering-caught-in-the-act/ for an example where failing to use mfence in practice leads to non-sequentially-consistent behaviour, i.e. memory reordering.

As you found, it is possible to map seq_cst to x86 asm with full barriers on every seq_cst load instead of on every seq_cst store / RMW. In that case you wouldn't need any barrier instructions on stores (so they'd have release semantics), but you'd need mfence before every atomic::load(seq_cst).

user2949652 · Answer 2 · 2013-11-05T08:46:52.730

-1

You don't need an mfence; sfence does indeed suffice. In fact, you never need lfence in x86 unless you are dealing with a device. But Intel (and I think AMD) has (or at least had) a single implementation shared with mfence and sfence (namely, flushing the store buffer), so there was no performance advantage to using the weaker sfence.

BTW, note that you don't have to flush after every write to a shared variable; you only have to flush between a write and a subsequent read of a different shared variable.

edited Nov 05 '13 at 08:46

answered Nov 05 '13 at 08:41

user2949652

563
2
5

Thanks! But I don't agree about that - I "never need lfence in x86". You can see my addition question about this, and see where we can use it "3. MFENCE+LOAD and STORE(without fence)" http://stackoverflow.com/q/19047327/1558037 I don't put any fences anywhere, but it is doing C/C++ compiler for each std::memory_order_seq_cst (Sequential Semantic Variable) – Alex Nov 05 '13 at 13:00
SFENCE does *not* give you seq-cst on x86 in general. It might on AMD where IIRC it's as strong as MFENCE. As you can see from the eventual answers to Alex's linked question, you *do* need a full barrier because you can't build mfence out of SFENCE+LFENCE. (As you say, you only need LFENCE after SSE4.1 weakly-ordered loads from WC memory, so it's basically never useful for memory ordering, only for its execution barrier effect.) – Peter Cordes Mar 19 '19 at 01:06

Why Sequential Semantic on x86/x86_64 is using through MOV [addr], reg + MFENCE instead of + SFENCE?

2 Answers2