As I see it, this question is not really about reordering. As you said yourself, a cardinal rule of concurrency is that every thread observes its own loads and stores as if in program order. Thread 1 absolutely sees the load of x
happen after the store of 0
to x
, and so the load will yield the value 0
(assuming no other thread is storing to x
concurrently). Thus the value stored to y will be 0, no ifs ands or buts, and no fences needed. The second thread cannot ever print wtf?
.
Reordering would be the question of whether the load of x
could become visible to another thread before the store of 0
. This is entirely possible. You could observe it by having Thread 2 do:
if (y == 0) {
x = 17;
System.out.println(x);
}
You might think that if we load 0 from y
, then the store to y
already happened, which means the load from x
already happened, which means the store of 0
to x
already happened. And so you would think that Thread 2 must print 17, because all other stores to x
happened earlier. But in fact it could print 0.
Note that to get this behavior, you must not synchronize Thread 2 to wait until after the second fence. If you do, then it can only print 17, since after the fence, all of Thread 1's loads and stores have finished and become visible. But if you don't, it's a data race, and that allows pretty free reordering, essentially C++'s memory_order_relaxed
if I am not mistaken.
A simple hardware mechanism that could cause this is a store buffer. When Thread 1 stores 0 to x
, perhaps the store does not go to the coherent cache right away, but is put in a store buffer while Core 1 waits to take ownership of that cache line. When it then loads from x
on the next line, it fulfills the load from its store buffer and sees the value 0. Meanwhile, Thread 2 stores 17 to x
, and its store hits the cache. Later, Core 1 finally gets ownership of the cache line and empties the store buffer, storing 0 to x
in cache. After that, Core 2 loads x
again, and now gets the new value of 0
from cache.
The compiler could also reorder the code to get the same effect. It would be free to rewrite the code as:
x = 1;
fullFence();
y = 0;
x = 0;
fullFence();
This is okay, because the only way for this code to behave differently from the original would be if some other thread were concurrently writing to x
. And that would be a data race, so the compiler does not have to care what the other thread sees in such a case.