0

According to the famous "Is Parallel Programming Hard, And, If So, What Can You Do About It?" Appendix C, programmers need to manually add memory barrier to avoid the case of memory ordering violation caused by store buffer and invalidate queue.

However, in both Intel software optimization manual and perfmon, they mentioned a feature called "MACHINE_CLEARS.MEMORY_ORDERING", which can automatically flush the pipeline "when a memory read may not conform to the memory ordering rules of the x86 architecture".

Memory order machine clear happens when a snoop request occurs and the machine is uncertain if memory ordering will be preserved. For instance, consider two loads: one to address X followed by another to address Y in the program order. Both loads were issued; however, load to Y completes first. and all the dependent ops and data on and by this load continue together. Load to X waits for the data. Simultaneously, another processor writes to the same address Y and causes a snoop to address Y. This presents a problem. The load to Y received the old value, but X is not finished loading. The other processor saw the loads in a different order by not consuming the latest value from the store to address Y. Everything from the load must be undone to address Y so the post-write data may be seen. Note: Without other pending reads, load Y does not require undoing. The ordering problem is caused by the unfinished load to X.

So could I conclude that software memory barrier is no longer necessary on processors with such self detection and repair ability?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    If you are writing x86 assembly, then sure, but if you are writing other languages such as C++, then the compiler can and will reorder statements as long as it conforms the language memory model. In that case, memory barriers are needed so that the compiler respects your desires. – Quân Anh Mai May 09 '23 at 03:09

1 Answers1

1

The x86 memory model (program order + a store buffer with store-forwarding) allows StoreLoad reordering, so you still need barriers if you want sequential consistency. (Usually not necessary, but the default for lock-free atomics in many high-level languages.)

But x86 doesn't need asm barrier instructions for acquire/release, so usually you just need to prevent compile-time reordering. (e.g. with C++ std::atomic_thread_fence(std::memory_order_acquire), or x.store(1, std::memory_order_release)).


x86 CPUs speculatively load early and out-of-order, and sometimes have to nuke the pipeline when they detect mis-speculation (since LoadLoad reordering isn't architecturally allowed), that's what MACHINE_CLEARS.MEMORY_ORDERING is about.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847