I think I understand aspects of memory ordering guarantees to some extent after reading a few materials on the Net. However it seems a little magical looking at the rules only from software and theoretical point of view. An example for why two processors could seem to reorder is explained here and helped me a lot to actually visualise the process. So what i understood is that the pre-fetcher could load the read early for one processor and does not do so for the other then to the outside observer it would look like the 1st processor did an earlier read than the 2nd (and could potentially now have stale value in absence of synchonisation) and thus see the instructions reordered.
After that i was actually looking for explanations from CPU point of view for more of how such effects can be produced. For instance, consider the acquire-release
fence. A classic example for this usually quoted is something like:
thread-0: x.store(true,std::memory_order_release);
thread-1: y.store(true,std::memory_order_release);
thread-2:
while(!x.load(std::memory_order_acquire));
if(y.load(std::memory_order_acquire)) ++z;
thread-3:
while(!y.load(std::memory_order_acquire));
if(x.load(std::memory_order_acquire)) ++z;
Since there is no total-order as in sequential consitency, thread-2 can see thread-0 doing its stuff 1st followed by thread-1 and thread-3 can see thread-1 doing its stuff 1st followed by thread-0. Thus z==0
can be a possible outcome.
If there was an explaination (say taking four cpu's each running one of the threads above) and what in hardware would happen to make us see this reorder, it would be immensely helpful. It does not have to be very complex real world detailed case (it can be though if that's the only way to understand it). Just an approximation like what the linked answer above does, with something about cache (or any participating factor) thrown in, it should do it for me (and probably many others ?) i guess.
Another one is:
thread-0:
x.store(true,std::memory_order_relaxed);
y.store(true,std::memory_order_release);
thread-1:
while(!y.load(std::memory_order_acquire)); // <------ (1)
if(x.load(std::memory_order_relaxed)) ++z;
Following the rules again, i can understand that this will never get z==0
(assuming all initial values are 0) and why changing (1)
to relaxed
might get us z==0
. But once more it sort of appears magical until i can think of how it can physically happen.
Thus any help (or pointers) taking adequate number of processors and their cache etc. for the explanation would be immense.