According to the famous "Is Parallel Programming Hard, And, If So, What Can You Do About It?" Appendix C, programmers need to manually add memory barrier to avoid the case of memory ordering violation caused by store buffer and invalidate queue.
However, in both Intel software optimization manual and perfmon, they mentioned a feature called "MACHINE_CLEARS.MEMORY_ORDERING", which can automatically flush the pipeline "when a memory read may not conform to the memory ordering rules of the x86 architecture".
Memory order machine clear happens when a snoop request occurs and the machine is uncertain if memory ordering will be preserved. For instance, consider two loads: one to address X followed by another to address Y in the program order. Both loads were issued; however, load to Y completes first. and all the dependent ops and data on and by this load continue together. Load to X waits for the data. Simultaneously, another processor writes to the same address Y and causes a snoop to address Y. This presents a problem. The load to Y received the old value, but X is not finished loading. The other processor saw the loads in a different order by not consuming the latest value from the store to address Y. Everything from the load must be undone to address Y so the post-write data may be seen. Note: Without other pending reads, load Y does not require undoing. The ordering problem is caused by the unfinished load to X.
So could I conclude that software memory barrier is no longer necessary on processors with such self detection and repair ability?