Does memory fencing blocks threads in multi-core CPUs?

Question

I was reading the Intel instruction set guide 64-ia-32 guide to get an idea on memory fences. My question is that for an example with SFENCE, in order to make sure that all store operations are globally visible, does the multi-core CPU parks all the threads even running on other cores till the cache coherence achieved ?

@Stephen C - why don’t you make this comment an answer? – theMayer Aug 12 '18 at 13:28 — theMayer, Aug 12 '18 at 13:28

Peter Cordes · Accepted Answer · 2020-12-18T18:32:46.787

Barriers don't make other threads/cores wait. They make some operations in the current thread wait, depending on what kind of barrier it is. Out-of-order execution of non-memory instructions isn't necessarily blocked.

Barriers don't even make your loads/stores visible to other threads any faster; CPU cores already commit (retired) stores from the store buffer to L1d cache as fast as possible. (After all the necessary MESI coherency rules have been followed, and x86's strong memory model only allows stores to commit in program order even without barriers).

Barriers don't necessarily order instruction execution, they order global visibility, i.e. what comes out the far end of the store buffer.

mfence (or a locked operation like lock add or xchg [mem], reg) makes all later loads/stores in the current thread wait until all previous loads and stores are completed and globally visible (i.e. the store buffer is flushed).

mfence on Skylake is implemented in a way that stalls the whole core until the store buffer drains. See my answer on Are loads and stores the only instructions that gets reordered? for details; this extra slowdown was to fix an erratum. But locked operations and xchg aren't like that on Skylake; they're full memory barriers but they still allow out-of-order execution of imul eax, edx, so we have proof that they don't stall the whole core.

With hyperthreading, I think this stalling happens per logical thread, not the whole core.

But note that the mfence manual entry doesn't say anything about stalling the core, so future x86 implementations are free to make it more efficient (like a lock or dword [rsp], 0), and only prevent later loads from reading L1d cache without blocking later non-load instructions.

sfence only does anything if there are any NT stores in flight. It doesn't order loads at all, so it doesn't have to stop later instructions from executing. See Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE?.

It just places a barrier in the store buffer that stops NT stores from reordering across it, and forces earlier NT stores to be globally visible before the sfence barrier can leave the store buffer. (i.e. write-combining buffers have to flush). But it can already have retired from the out-of-order execution part of the core (the ROB, or ReOrder Buffer) before it reaches the end of the store buffer.)

See also Does a memory barrier ensure that the cache coherence has been completed?

lfence as a memory barrier is nearly useless: it only prevents movntdqa loads from WC memory from reordering with later loads/stores. You almost never need that.

The actual use-cases for lfence mostly involve its Intel (but not AMD) behaviour that it doesn't allow later instructions to execute until it itself has retired. (so lfence; rdtsc on Intel CPUs lets you avoid having rdtsc read the clock too soon, as a cheaper alternative to cpuid; rdtsc)

Another important recent use-case for lfence is to block speculative execution (e.g. before a conditional or indirect branch), for Spectre mitigation. This is completely based on its Intel-guaranteed side effect of being partially serializing, and has nothing to do with its LoadLoad + LoadStore barrier effect.

lfence does not have to wait for the store buffer to drain before it can retire from the ROB, so no combination of LFENCE + SFENCE is as strong as MFENCE. Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE?

Related: When should I use _mm_sfence _mm_lfence and _mm_mfence (when writing in C++ instead of asm).

Note that the C++ intrinsics like _mm_sfence also block compile-time memory ordering. This is often necessary even when the asm instruction itself isn't, because C++ compile-time reordering happens based on C++'s very weak memory model, not the strong x86 memory model which applies to the compiler-generated asm.

So _mm_sfence may make your code work, but unless you're using NT stores it's overkill. A more efficient option would be std::atomic_thread_fence(std::memory_order_release) (which turns into zero instructions, just a compiler barrier.) See http://preshing.com/20120625/memory-ordering-at-compile-time/.

RE "lfence as a memory barrier is nearly useless": lfence is now the mainstream way of dealing with most Spectre-like vulnerabilities in software. Anyway, the question seems to me too broad because a detailed discussion of each fence is a lot to write. But this answer should resolve the main misunderstanding of the OP I think. — Hadi Brais, Aug 12 '18 at 22:08
@HadiBrais: Exactly. That use case has nothing to do with ordering between two data accesses to block LoadLoad or LoadStore reordering. It's for the Intel-guaranteed side-effect of blocking OoO exec. — Peter Cordes, Aug 12 '18 at 22:11
Re "CPU cores already commit (retired) stores from the store buffer to L1d cache as fast as possible": I'm interested in the part "as fast as possible". Is that accurate? I think the a store can intentionally hang out for a while in the store buffer to benefit from store-load forwarding. — Hadi Brais, Aug 12 '18 at 22:15
From 11.10 of V3: "It [the store buffer] also allows writes to be delayed for more efficient use of memory-access bus cycles." I don't know what this means exactly, but I think you see my point. Not only store-load forwarding but also store coalescing and combining. — Hadi Brais, Aug 12 '18 at 22:19
@HadiBrais: That sounds like a description of why the store buffer *exists* in the first place, to decouple in-order commit from the execution pipeline, and from loads. I haven't heard of intentionally delaying commit. Would that help for a store/reload that's split across a cache-line boundary? L1d load/use latency is about the same as store-forward latency, and SF latency doesn't include address-generation latency. Maybe if a store-forwarding was already detected and lined up? If it's possible for that to happen in the same cycle that the data could have otherwise committed? — Peter Cordes, Aug 12 '18 at 22:24
Yes I can imagine that might be useful, depending on other instructions. Even if the latency is about the same, there are only two load ports in the L1D. Another question is that is it useful to commit the store as fast as possible? This is important when another needs to see the data, but a synchronization mechanism needs to be used anyway. This is also important for persistent memory, but also synchronization mechanism must be used there too. Otherwise, it does not seem useful to make it visible "as fast as possible" and it seems useful to delay it in some cases. — Hadi Brais, Aug 12 '18 at 22:35
@HadiBrais: what does persistent memory have to do with anything? I'm talking about commit from the store buffer to L1d (globally visible because of MESI), not write-back to NVDIMM. Commit as fast as possible is done to free up store-buffer entries for later stores, just like retirement from the ROB is done as fast as possible. Re: L1d load ports: there are only 2 load-data execution units. Split loads (and I guess page walks) can also take cycles, but those are hopefully rare, and expected to hurt throughput. (Or does the page walker load from L2? I think so in P6, but I forget). — Peter Cordes, Aug 12 '18 at 22:49
@HadiBrais: If you want HW to be able to decide not to commit a store to L1d in a cycle when it could have done so, you need extra logic to decide how long to wait, and so on. You need to avoid delaying a store indefinitely, even if the core doesn't do any more stores for a long time, because that would be weird for a core to keep a store private for an extended period of time. — Peter Cordes, Aug 12 '18 at 22:52
Persistent memory is a scenario where flushing the store buffer (and the cache) is required. I was arguing that flushing the store buffer as fast as possible is not useful. I was trying think of a scenario where that can be useful. — Hadi Brais, Aug 12 '18 at 23:56
@HadiBrais: I think the obvious reason is to prevent future stalls from the store buffer being full, defeating the decoupling of OoO exec from store commit. It's only safe to delay commit if you can see the future and see there won't be any cache-miss stores that prevent you from doing later commits at 1 per clock. (Remember x86's strong memory model requires in-order commit). Any possible downside from commit-as-fast-as-possible is pretty small, so it doesn't seem worth it to build extra logic to consider delaying it. — Peter Cordes, Aug 13 '18 at 00:14
The situation for AMD and `lfence` is a bit more complicated than "AMD doesn't treat it as serializing". Some AMD archs (apparently) do treat it as serializing, and since Spectre, some allow you to set an MSR bit to make `lfence` serializing. See [here](https://lore.kernel.org/patchwork/patch/870217/) and [here](https://developer.amd.com/wp-content/resources/Managing-Speculation-on-AMD-Processors.pdf) for some details. — BeeOnRope, Aug 14 '18 at 15:18
This AMD/`lfence` thing comes up enough that maybe it deserves a [canonical question](https://stackoverflow.com/q/51844886/149138) (and hopefully one day a canonical answer). — BeeOnRope, Aug 14 '18 at 15:29

Does memory fencing blocks threads in multi-core CPUs?

1 Answers1

Linked