But, CPU can execute out of order instructions and in fact, it is possbile to execute program in that order:
Out-of-order execution is different from reordering of when loads / stores become globally visible. OoOE preserves the illusion of your programming running in-order. Memory re-ordering is possible without OoOE. Even an in-order pipelined core will want to buffer its stores. See parts of this answer, for example.
If I am right, is it possible to force to get output equals to 0?
Not on x86, which only does StoreLoad reordering, not StoreStore reordering. If the compiler reorders the stores to x
and f
at compile time, then you will sometimes see x==0
after seeing f==1
. Otherwise you will never see that.
A short sleep after spawning thread1 before spawning thread2 would also make sure thread1 was spinning on x
before you modify it. Then you don't need thread2, and can actually do the stores from the main thread.
Have a look at Jeff Preshing's Memory Reordering Caught In The Act for a real program that does observe run-time memory reordering on x86, once per ~6k iterations on a Nehalem.
On a weakly-ordered architecture, you could maybe see StoreStore reordering at run-time with something like your test program. But you'd likely have to arrange for the variables to be in different cache lines! And you'd need to test in a loop, not just once per program invocation.
How to make safe this code? I know about mutex/semaphores and I could protect f with mutex but I have heard something about memory fences, please say me more.
Use C++11 std::atomic to get acquire/release semantics on your accesses to f
.
std::atomic<uin32t_t> f; // flag to indicate when x is ready
uint32_t x;
...
// don't use new when a local with automatic storage works fine
std::thread t1 = std::thread([&f, &x](){
while( f.load(std::memory_order_acquire) == 0);
std::cout << x << endl;});
// or sleep a few ms, and do t2's work in the main thread
std::thread t2 = std::thread([&f, &x](){
x = 42; f.store(1, std::memory_order_release);});
The default memory ordering for something like f = 1
is mo_seq_cst, which requires an MFENCE
on x86, or an equivalent expensive barrier on other architectures.
On x86, the weaker memory ordering just prevent compile-time reordering, but don't require any barrier instructions.
std::atomic also prevents the compiler from hoisting the load of f
out of the while
loop in thread1, like @Baum's comment describes. (Because atomic has semantics like volatile
, where it's assumed that the stored value can change asynchronously. Since data races are undefined behaviour, the compiler normally can hoist loads out of loops, unless alias analysis fails to prove that stores through pointers inside the loop can't modify the value.).