7

I have already seen this answer and this answer, but neither appears to clear and explicit about the equivalence or non-equivalence of mfence and xchg under the assumption of no non-temporal instructions.

The Intel instruction reference for xchg mentions that this instruction is useful for implementing semaphores or similar data structures for process synchronization, and further references Chapter 8 of Volume 3A. That reference states the following.

For the P6 family processors, locked operations serialize all outstanding load and store operations (that is, wait for them to complete). This rule is also true for the Pentium 4 and Intel Xeon processors, with one exception. Load operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized.

The mfence documentation claims the following.

Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes the MFENCE instruction in program order becomes globally visible before any load or store instruction that follows the MFENCE instruction. 1 The MFENCE instruction is ordered with respect to all load and store instructions, other MFENCE instructions, any LFENCE and SFENCE instructions, and any serializing instructions (such as the CPUID instruction). MFENCE does not serialize the instruction stream.

If we ignore weakly ordered memory types, does xchg (which implies lock) encompass all of mfence's guarantees with respect to memory ordering?

Hadi Brais
  • 22,259
  • 3
  • 54
  • 95
merlin2011
  • 71,677
  • 44
  • 195
  • 329

1 Answers1

6

Assuming you're not writing a device-driver (so all the memory is Write-Back, not weakly-ordered Write-Combining), then yes xchg is as strong as mfence.

NT stores are fine.

I'm sure that this is the case on current hardware, and fairly sure that this is guaranteed by the wording in the manuals for all future x86 CPUs. xchg is a very strong full memory barrier.

Hmm, I haven't looked at prefetch instruction reordering. That might possibly be relevant for performance, or possibly even correctness in weird device-driver situations (where you're using cacheable memory when you probably shouldn't be).


From your quote:

(P4/Xeon) Load operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized.

That's the one thing that makes xchg [mem] weaker then mfence (on Pentium4? Probably also on Sandybridge-family).

mfence does guarantee that, which is why Skylake had to strengthen it to fix an erratum. (Are loads and stores the only instructions that gets reordered?, and also the answer you linked on Does lock xchg have the same behavior as mfence?)

NT stores are serialized by xchg / lock, it's only weakly-ordered loads that may not be serialized. You can't do weakly-ordered loads from WB memory. movntdqa xmm, [mem] on WB memory is still strongly-ordered (and on current implementations, also ignores the NT hint instead of doing anything to reduce cache pollution).


It looks like xchg performs better for seq-cst stores than mov+mfence on current CPUs, so you should use that in normal code. (You can't accidentally map WC memory; normal OSes will always give you WB memory for normal allocations. WC is only used for video RAM or other device memory.)


These guarantees are specified in terms of specific families of Intel microarchitectures. It would be nice if there was some common "baseline x86" guarantees that we could assume for future Intel and AMD CPUs.

I assume but haven't checked that the xchg vs. mfence situation is the same on AMD. I'm sure there's no correctness problem with using xchg as a seq-cst store, because that's what compilers other than gcc actually do.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    Regarding AMD's behavior, see my answer on [Do locked instructions provide a barrier between weakly-ordered accesses?](https://stackoverflow.com/questions/50280857/do-locked-instructions-provide-a-barrier-between-weakly-ordered-accesses), which I think is inconsistent with Bee's answer you linked. I hope that the OP is aware of the many subtle differences between Intel and AMD, if they care about AMD processors. But from the question, it's doesn't seem so. – Hadi Brais Aug 22 '18 at 23:33
  • Now I'm thinking does any of the 32-bit x86 AMD processors support SSE2 (of which mfence is part)? It does not seem so. I said in my answer "and I think 32-bit x86 AMD processors", but that wouldn't apply to these processors because they don't support mfence. – Hadi Brais Aug 22 '18 at 23:40
  • 1
    @HadiBrais: Unless Geode or something supports `mfence` for backwards compat (but not SSE1 or the rest of SSE2), then probably no. K8 introduced SSE2 and AMD64 in the same microarchitecture. It's extremely unlikely that `mfence` is somehow weaker on AMD outside of long/compat mode on CPUs that are 64-bit capable, and also unlikely that there's an undocumented weaker `mfence` that AMD neglected to mention on a few 32-bit-only CPUs. If they talk about AMD64 `mfence`, that's probably because it was new with AMD64 for them. – Peter Cordes Aug 22 '18 at 23:45
  • 1
    mfence is listed on page 639 of the [AMD Geode data book](https://support.amd.com/TechDocs/33234H_LX_databook.pdf). Also lfence and sfence are there. But they are not described there. Interesting. – Hadi Brais Aug 22 '18 at 23:50
  • 1
    My impression is that `xchg` was supposed to be as strong as `mfence`, even for WC memory loads, but that it didn't pan out due to an oversight, because the WC-load behavior for `lock`-prefixed instructions (including the implicitly locked `xchg`) was mentioned in an errata and only appeared long after the processors release. Probably they couldn't fix that without a big performance impact to `locked` instructions, so `mfence` took the hit. At least that's what I thought - but the first quote in the OP, if old, implies that maybe this was known a while ago... – BeeOnRope Aug 23 '18 at 02:34