At Intel x86/x86_64 systems have 3 types of memory barriers: lfence, sfence and mfence. The question in terms of their use.
For Sequential Semantic (SC) is sufficient to use MOV [addr], reg + MFENCE
for all memory cells requiring SC-semantics. However, you can write code in the whole and vice versa: MFENCE + MOV reg, [addr]
. Apparently felt, that if the number of stores to memory is usually less than the loads from it, then the use of write-barrier in total cost less. And on this basis, that we must use sequential stores to memory, made another optimization - [LOCK] XCHG, which is probably cheaper due to the fact that "MFENCE inside in XCHG" applies only to the cache line of memory used in XCHG (video where on 0:28:20 said that MFENCE more expensive that XCHG).
http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
C/C++11 Operation x86 implementation
- Load Seq_Cst: MOV (from memory)
- Store Seq Cst: (LOCK) XCHG // alternative: MOV (into memory),MFENCE
Note: there is an alternative mapping of C/C++11 to x86, which instead of locking (or fencing) the Seq Cst store locks/fences the Seq Cst load:
- Load Seq_Cst: LOCK XADD(0) // alternative: MFENCE,MOV (from memory)
- Store Seq Cst: MOV (into memory)
The difference is that ARM and Power memory barriers interact exclusively with LLC (Last Level Cache), and x86 interact and with lower level caches L1/L2. In x86/x86_64:
lfence
on Core1: (CoreX-L1) -> (CoreX-L2) -> L3-> (Core1-L2) -> (Core1-L1)sfence
on Core1: (Core1-L1) -> (Core1-L2) -> L3-> (CoreX-L2) -> (CoreX-L1)
In ARM:
ldr; dmb;
: L3-> (Core1-L2) -> (Core1-L1)dmb; str; dmb;
: (Core1-L1) -> (Core1-L2) -> L3
C++11 code compiled by GCC 4.8.2 - GDB in x86_64:
std::atomic<int> a;
int temp = 0;
a.store(temp, std::memory_order_seq_cst);
0x4613e8 <+0x0058> mov 0x38(%rsp),%eax
0x4613ec <+0x005c> mov %eax,0x20(%rsp)
0x4613f0 <+0x0060> mfence
But why on x86/x86_64 Sequential Semantic (SC) using through MOV [addr], reg + MFENCE
, and not MOV [addr], reg + SFENCE
, why do we need full-fence MFENCE
instead of SFENCE
there?