4

What is the difference in logic and performance between x86-instructions LOCK XCHG and MOV+MFENCE for doing a sequential-consistency store.

(We ignore the load result of the XCHG; compilers other than gcc use it for the store + memory barrier effect.)

Is it true, that for sequential consistency, during the execution of an atomic operation: LOCK XCHG locks only a single cache-line, and vice versa MOV+MFENCE locks whole cache-L3(LLC)?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Alex
  • 12,578
  • 15
  • 99
  • 195
  • 2
    Apples and oranges, MFENCE doesn't provide atomicity. – Hans Passant Sep 30 '13 at 14:10
  • 2
    @Hans Passant I didn't say that MFENCE provide atomicity, because MOV already atomic - this we can see in C11(`atomic`)/C++11(`std::atomic`) for all ordering in x86 except SC(sequential consistency): http://en.cppreference.com/w/cpp/atomic/memory_order But i said that **MFENCE provide sequential consistency** for atomic variables as we can see in C11(`atomic`)/C++11(`std::atomic`) in GCC4.8.2: http://stackoverflow.com/questions/19047327/why-gcc-does-not-use-loadwithout-fence-and-storesfence-for-stdmemory-order – Alex Sep 30 '13 at 14:37
  • `mov` maybe atomic for what it does, but `xchg` can't be expressed as a single `mov`. – Kerrek SB Sep 30 '13 at 14:49
  • 1
    (I'm not even sure if `mov` is atomic for unaligned access, by the way.) – Kerrek SB Sep 30 '13 at 14:57
  • 2
    @Kerrek SB `MOV+MFENCE`(SC in GCC4.8.2) we can replace on `LOCK XCHG` for SC as we can see in video where on **0:28:20** said that MFENCE more expensive that XCHG: http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-2-of-2 – Alex Sep 30 '13 at 15:18
  • 1
    @Alex, see also here - http://stackoverflow.com/questions/19059542/how-do-fences-atomize-load-modify-store-operations/19060548#19060548 – Leeor Sep 30 '13 at 17:46
  • I thought LOCK was implicit with XCHG? Does specifying LOCK XCHG actually do anything different than just an XCHG? – Brian Knoblauch Oct 01 '13 at 11:31
  • 1
    @BrianKnoblauch: Yes, `lock` is already implicit for `xchg [mem], reg`. Hopefully when people say LOCK XCHG, they're just talking about the implied behaviour. I'm not sure if any assemblers will omit the `lock` prefix from the machine code if you write `lock xchg`, but they could. – Peter Cordes Oct 20 '18 at 22:40
  • @KerrekSB: This question is asking about 2 methods for doing a seq_cst store, where we ignore the load result of the `xchg` and just use it to do a store + memory barrier. Turns out it's more efficient to use `xchg` on Intel Skylake at least, where `mfence` blocks out-of-order exec of independent non-memory instructions. I'm closing this as a dup for now because I addressed this in an answer on a related question, but maybe this question deserves its own answer. [Which is a better write barrier on x86: lock+addl or xchgl?](https://stackoverflow.com/a/52910647) is related. – Peter Cordes Oct 20 '18 at 22:42
  • @PeterCordes: Sure, makes sense, thanks. – Kerrek SB Oct 21 '18 at 10:38

1 Answers1

-1

The difference is in purpose of usage.

MFENCE (or SFENCE or LFENCE) is useful when we are locking a part of memory region accessible from two or more threads. When we atomically set the lock for this memory region we can after that use all non-atomic instruction, because there are faster. But we must call SFANCE (or MFENCE) one instruction before unlocking the memory region to ensure that locked memory is visible correctly to all other threads.

If we are changing only a single memory aligned variable, then we are using atomic instructions like LOCK XCHG so no lock of memory region is needed.

Antoine
  • 1,782
  • 1
  • 14
  • 32
GJ.
  • 10,810
  • 2
  • 45
  • 62
  • Do you mean if we want a sequential consistency for a large area of ​​memory (8 Bytes - 1 MB and more), the best performance use MFENCE, and if we want to get a sequential consistency for a small area of ​​memory as a single variable (1 byte (char) - 8 bytes (long long )), then the better use LOCK XCHG? Because LOCK - locks only a single cache-line, but MFENCE locks whole cache-L3(LLC). – Alex Sep 30 '13 at 16:48
  • @Alex: Yes MFENCE only ensure that load-from-memory and store-to-memory is guaranted visible to all threads correctly after execution of that instruction. MFENCE have nothing common with atomic instructions. – GJ. Sep 30 '13 at 17:21
  • 4
    No, an x86 lock is in itself an mfence (it's even said in the video here), so you don't need another one (let alone any one directional fence at entry/exit of critical sections). Also, there's no such thing as locking the L3, **mfence does not lock anything** (so it does not ensure any atomicity), it just ensures serialization of all memory operations *in the thread that used it* – Leeor Sep 30 '13 at 17:50
  • @Leeor I know that MFENCE = LFENCE(getting data from L1/L2 caches of others cpu-cores for own cpu-core in L1/L2 caches with Invalid-cache-line) + SFENCE(dissemination of own Modified-cache lines of own cpu-core to the others L1/L2 caches of other cpu-cores). But is MFENCE not lock the bus at the time of these updates Invalid/Modified-cache lines, namely blocking-L3 cache and RAM at all time of during execution MFENCE? Because a sequence of such an exchange is very important to comply for the sequential consistency, ie while MFENCE is executing on one core, the other cores can't launch MFENCE. – Alex Sep 30 '13 at 18:15
  • @Leeor I don't say that `MFENCE` locks bus for any other instructions except self, but for it self locks RAM-bus and L3(LLC). Only `LOCK` locks instructions for which it is a prefix, but lock only a single cache line for this memory cell - lock cache line condition at "Modify" during execution: `LOCK XCHG, LOCK XADD, LOCK CMPXCHG`. – Alex Sep 30 '13 at 18:19
  • 1
    @Alex, I think you got it mixed up a little - fences are creatures of the ISA, x86 in this case. Caches are implementation detailes, and are "under the hood" mostly. Any x86 load/store operation will collect coherent data from other cores/sockets thanks to a MESI/snoops protocol. Modified lines in your own core are also maintained by that protocol (although there is an ISA hook to flush them out - but that's with wbinvd/clflush, not sfence). Either way, the exact behavior of the HW may differ between products (but most modern CPUs don't have to go with expensive bus locks for these ops) – Leeor Sep 30 '13 at 18:37
  • @Leeor You are right for the old single-core processors on which the cache coherence protocol was **MESI**, and for other devices used "snoops", but for multi-core processors the problem of concurrency is solved through protocols **MOESI/MESIF**. Through using a prefix `LOCK` Owned/Forward/Modified-conditions holds (to lock) for the duration of an atomic operation for current cache-line. Similarly (but for the whole shared for all cpu-cores cache L3 and RAM) **should be strict sequence of instructions**: `MFENCE` from CoreX, `MFENCE` from CoreY, `MFENCE` from CoreZ... How is it providing? – Alex Sep 30 '13 at 19:06
  • 3
    MESIF/MOESI allow some optimization in HW, but are not relevant here - a lock will hold any line in place regardless of state. However, I don't agree with your 2nd part - MFENCE applies only for the program order in a given thread, not others. It may help in some consistency cases (as I wrote here - http://stackoverflow.com/questions/19059542/how-do-fences-atomize-load-modify-store-operations/19060548#19060548), but only because it serializes each thread internally, not through any atomicity, or "cache locking" as you insinuate. If you think otherwise, please open a question with an example. – Leeor Sep 30 '13 at 19:28
  • @Leeor If you think that "MFENCE applies only for the program order in a given thread", and if the thread-1 and thread-2 at the same time (simultaneously) perform an operation SFENCE, then through this SFENCE how can we ensure that the thread-3 will see the data in its L1/L2 cache initially from thread-1 (SFENCE), and then from the thread-2 (SFENCE), and the thread-4 receive data is the same sequence, if the two SFENCEs on two threads (1 & 2) performed simultaneously? – Alex Sep 30 '13 at 19:48
  • @Alex - well, barriers is one example. – Leeor Sep 30 '13 at 21:38
  • 2
    @Alex SFENCE is an ordered flush of local outstanding writes to shared memory. Two cores can do simultaneous SFENCE and a third core will see the writes interleaved. Intel says: "Writes from an individual processor are NOT ordered with respect to the writes from other processors." – smossen Oct 01 '13 at 18:06
  • 1
    In addition to being wrong, this doesn't answer the question. – BeeOnRope Oct 21 '18 at 00:05