Is software memory barrier unnecessary on processors which can automatically trigger machine clear on memory ordering violation?

Question

According to the famous "Is Parallel Programming Hard, And, If So, What Can You Do About It?" Appendix C, programmers need to manually add memory barrier to avoid the case of memory ordering violation caused by store buffer and invalidate queue.

However, in both Intel software optimization manual and perfmon, they mentioned a feature called "MACHINE_CLEARS.MEMORY_ORDERING", which can automatically flush the pipeline "when a memory read may not conform to the memory ordering rules of the x86 architecture".

Memory order machine clear happens when a snoop request occurs and the machine is uncertain if memory ordering will be preserved. For instance, consider two loads: one to address X followed by another to address Y in the program order. Both loads were issued; however, load to Y completes first. and all the dependent ops and data on and by this load continue together. Load to X waits for the data. Simultaneously, another processor writes to the same address Y and causes a snoop to address Y. This presents a problem. The load to Y received the old value, but X is not finished loading. The other processor saw the loads in a different order by not consuming the latest value from the store to address Y. Everything from the load must be undone to address Y so the post-write data may be seen. Note: Without other pending reads, load Y does not require undoing. The ordering problem is caused by the unfinished load to X.

So could I conclude that software memory barrier is no longer necessary on processors with such self detection and repair ability?

If you are writing x86 assembly, then sure, but if you are writing other languages such as C++, then the compiler can and will reorder statements as long as it conforms the language memory model. In that case, memory barriers are needed so that the compiler respects your desires. — Quân Anh Mai, May 09 '23 at 03:09

score 1 · Answer 1 · answered May 08 '23 at 16:32

The x86 memory model (program order + a store buffer with store-forwarding) allows StoreLoad reordering, so you still need barriers if you want sequential consistency. (Usually not necessary, but the default for lock-free atomics in many high-level languages.)

Why does a std::atomic store with sequential consistency use XCHG?
Can a speculatively executed CPU branch contain opcodes that access RAM? (why the store buffer exists, and why it creates StoreLoad reordering)
How do modern Intel x86 CPUs implement the total order over stores
Globally Invisible load instructions - loads that partially overlap with recent stores can sometimes see values that no other core can ever see, thanks to slow-path store-forwarding.
How does memory reordering help processors and compilers? - Why allowing StoreLoad reordering is important for CPU performance. The others can all be worked around (e.g. with speculative early loads that sometimes result in pipeline nukes on mis-speculation), but storing late is essential.

But x86 doesn't need asm barrier instructions for acquire/release, so usually you just need to prevent compile-time reordering. (e.g. with C++ std::atomic_thread_fence(std::memory_order_acquire), or x.store(1, std::memory_order_release)).

how are barriers/fences and acquire, release semantics implemented microarchitecturally?
C++ How is release-and-acquire achieved on x86 only using MOV?
When are x86 LFENCE, SFENCE and MFENCE instructions required? (very rarely, especially lfence is near useless for memory ordering).
When should I use _mm_sfence _mm_lfence and _mm_mfence (The C intrinsics also give compile-time ordering, but there are cheaper ways to get that ordering without the unnecessary asm instruction.)

x86 CPUs speculatively load early and out-of-order, and sometimes have to nuke the pipeline when they detect mis-speculation (since LoadLoad reordering isn't architecturally allowed), that's what MACHINE_CLEARS.MEMORY_ORDERING is about.

Why flush the pipeline for Memory Order Violation caused by other logical processors?
Does an x86 CPU reorder instructions?
What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings? - interesting example where two logical cores on the same physical core are slower, causing way more memory ordering machine clears than separate physical cores.

Is software memory barrier unnecessary on processors which can automatically trigger machine clear on memory ordering violation?

1 Answers1