If i
and j
are local variables, nothing. The compiler can keep them in registers across the function call if it can prove that nothing outside the current function have their address.
But any global variables, or locals whose address might be stored in a global, do have to be "in sync" in memory for a non-inline function call. The compiler has to assume that any function call it can't inline modifies any / every variable it can possibly have a reference to.
So for example, if int i;
is a local variable, after sscanf("0", "%d", &i);
its address will have escaped the function and the compiler will then have to spill/reload it around function calls instead of keeping it in a call-preserved register.
See my answer on Understanding volatile asm vs volatile variable, with an example of asm volatile("":::"memory")
being a barrier for a local variable whose address escaped the function (sscanf("0", "%d", &i);
), but not for locals that are still purely local. It's exactly the same behaviour for exactly the same reason.
I assume that the above quote is talking about CPU reordering and not about compiler reordering.
It's talking about both, because both are necessary for correctness.
This is why the compiler can't reorder updates to shared variables with any function call. (This is very important: the weak C11 memory model allows lots of compile-time reordering. The strong x86 memory model only allows StoreLoad reordering, and local store-forwarding.)
pthread_mutex_lock
being a non-inline function call takes care of compile-time reordering, and the fact that it does a lock
ed operation, an atomic RMW, also means it includes a full runtime memory barrier on x86. (Not the call
instruction itself, though, just the code in the function body.) This gives it acquire semantics.
Unlocking a spinlock only needs a release-store, not a RMW, so depending on the implementation details the unlock function might not be a StoreLoad barrier. (This is still ok: it keeps everything in the critical section from getting out. It's not necessary to stop later operations from appearing before the unlock. See Jeff Preshing's article explaining Acquire and Release semantics)
On a weakly-ordered ISA, those mutex functions would run barrier instructions, like ARM dmb
(data memory barrier). Normal functions wouldn't, so the author of that guide is correct to point out that those functions are special.
Now what prevents the CPU from reordering mov 10 into i and mov 20 into j to above call pthread_mutex_lock()
This isn't the important reason (because on a weakly-ordered ISA pthread_mutex_unlock
would run a barrier instruction), but it is actually true on x86 that the stores can't even be reorder with the call
instruction, let alone actual locking/unlocking of the mutex done by the function body before the function returns.
x86 has strong memory-ordering semantics (stores don't reorder with other stores), and call
is a store (pushing the return address).
So mov [i], 10
must appear in the global store between the stores done by the call
instruction.
Of course in a normal program, nobody is observing the call stack of other threads, just the xchg
to take the mutex or the release-store to release it in pthread_mutex_unlock
.