Memory ordering from hardware perspective

Question

I think I understand aspects of memory ordering guarantees to some extent after reading a few materials on the Net. However it seems a little magical looking at the rules only from software and theoretical point of view. An example for why two processors could seem to reorder is explained here and helped me a lot to actually visualise the process. So what i understood is that the pre-fetcher could load the read early for one processor and does not do so for the other then to the outside observer it would look like the 1st processor did an earlier read than the 2nd (and could potentially now have stale value in absence of synchonisation) and thus see the instructions reordered.

After that i was actually looking for explanations from CPU point of view for more of how such effects can be produced. For instance, consider the acquire-release fence. A classic example for this usually quoted is something like:

thread-0: x.store(true,std::memory_order_release);
thread-1: y.store(true,std::memory_order_release);

thread-2:
while(!x.load(std::memory_order_acquire));
if(y.load(std::memory_order_acquire)) ++z;

thread-3:
while(!y.load(std::memory_order_acquire));
if(x.load(std::memory_order_acquire)) ++z;

Since there is no total-order as in sequential consitency, thread-2 can see thread-0 doing its stuff 1st followed by thread-1 and thread-3 can see thread-1 doing its stuff 1st followed by thread-0. Thus z==0 can be a possible outcome.

If there was an explaination (say taking four cpu's each running one of the threads above) and what in hardware would happen to make us see this reorder, it would be immensely helpful. It does not have to be very complex real world detailed case (it can be though if that's the only way to understand it). Just an approximation like what the linked answer above does, with something about cache (or any participating factor) thrown in, it should do it for me (and probably many others ?) i guess.

Another one is:

thread-0:
x.store(true,std::memory_order_relaxed);
y.store(true,std::memory_order_release);

thread-1:
while(!y.load(std::memory_order_acquire)); // <------ (1)
if(x.load(std::memory_order_relaxed)) ++z;

Following the rules again, i can understand that this will never get z==0 (assuming all initial values are 0) and why changing (1) to relaxed might get us z==0. But once more it sort of appears magical until i can think of how it can physically happen.

Thus any help (or pointers) taking adequate number of processors and their cache etc. for the explanation would be immense.

The link refers to X86 StoreLoad reordering, which is not a type of reordering done by the CPU, but merely the result of a cache effect. I.e. the CPU writing to a store buffer before it enters the cache (and is propagated by cache-coherency). — LWimsey, Mar 02 '17 at 22:01
Right but the net result is the same is it not ? If proper sync primitives were involved the earlier pre-fetched read would be refreshed (i.e. prevented from being stale) in case of a write. In their absence we can see how physically mem-order-relaxed for instance will make the instructions appear to be executed out of order to the external observer. As you say it may not just be this, maybe actual out of order execution or a mix of both. So that's what i am looking for here - an explanation of physical phenomena that lead to examples producing the side effects shown. — ustulation, Mar 03 '17 at 13:15
Yes, the effect is the same; as if the store and the load had been reordered. A typical way to prevent it in a TSO model (X86) is to insert a full MFENCE right after the store. relatively infrequent though since many software designs do not rely on it. it's just an example of how reordering plays out on X86.. An answer to this question should describe more causes. — LWimsey, Mar 03 '17 at 20:08
There's a good explanation of how IRIW reordering can happen, as in your first example, at https://stackoverflow.com/questions/27807118/will-two-atomic-writes-to-different-locations-in-different-threads-always-be-see/50679223#50679223 . The example is POWER, and the short explanation is that as a consequence of the CPU topology, certain threads may get special early access to stores made by certain other threads, and be able to read them before they hit the L1 cache and become globally visible. — Nate Eldredge, Feb 05 '22 at 05:24

Memory ordering from hardware perspective

0 Answers0