How does a mutex lock and unlock functions prevents CPU reordering?

Question

As far as I know, a function call acts as a compiler barrier, but not as a CPU barrier.

This tutorial says the following:

acquiring a lock implies acquire semantics, while releasing a lock implies release semantics! All the memory operations in between are contained inside a nice little barrier sandwich, preventing any undesireable memory reordering across the boundaries.

I assume that the above quote is talking about CPU reordering and not about compiler reordering.

But I don't understand how does a mutex lock and unlock causes the CPU to give these functions acquire and release semantics.

For example, if we have the following C code:

pthread_mutex_lock(&lock);
i = 10;
j = 20;
pthread_mutex_unlock(&lock);

The above C code is translated into the following (pseudo) assembly instructions:

push the address of lock into the stack
call pthread_mutex_lock()
mov 10 into i
mov 20 into j
push the address of lock into the stack
call pthread_mutex_unlock()

Now what prevents the CPU from reordering mov 10 into i and mov 20 into j to above call pthread_mutex_lock() or to below call pthread_mutex_unlock()?

If it is the call instruction that prevents the CPU from doing the reordering, then why is the tutorial I quoted makes it seem like it is the mutex lock and unlock functions that prevents the CPU reordering, why the tutorial I quoted didn't say that any function call will prevent the CPU reordering?

My question is about the x86 architecture.

Why didn't you undelete and edit your previous version of this: https://stackoverflow.com/questions/50948788/how-does-a-mutex-lock-and-unlock-accomplishes-acquire-and-release-semantics? This is the same question with minor edits, and Martin's comment still applies. — Peter Cordes, Jun 20 '18 at 14:52
"I assume that the above quote is talking about CPU reordering and not about compiler reordering." Certainly, locking/unlocking a pthread mutex MUST ensure the compiler does not re-order the instructions it is protecting. So the quote is talking about both compiler and CPU reordering - they're equally important from the point of a pthread mutex. — nos, Jun 20 '18 at 15:23
It is the *implementations* of `pthread_mutex_lock()` and `pthread_mutex_unlock()` that realize their promises about runtime ordering. CPUs that perform such reordering also have instructions for modulating it, and the mutex lock / unlock functions use these (among other things). — John Bollinger, Jun 20 '18 at 16:26
@JohnBollinger: I already mentioned this in my answer. I highlighted it since it's an important point about why the mutex functions are special. — Peter Cordes, Jun 20 '18 at 18:05
And I already upvoted your answer, @PeterCordes, and I declined to write one of my own. But your answer is so full of information that even with the highlighting, it is easy to overlook this point -- which I think is the crux -- within. — John Bollinger, Jun 20 '18 at 18:23
@JohnBollinger: That's fair, it's definitely a key point. I started writing before reading the question super carefully, so I probably answered some stuff that the question doesn't ask explicitly. Seemed like a good idea to cover a bunch of related ground. (And I started with compiler reordering because I'd just been looking at another Q&A about that. :P) — Peter Cordes, Jun 20 '18 at 18:28
I wrote my answer because although @PeterCordes answer is great I thought this question also deserved one "little" answer just focusing on the narrow thing that I thought the OP was asking: how does the compiler/runtime/implementation prevent CPU reordering (not compiler reordering). The answer for that is simple: the implementation includes barriers which prevent CPU reordering. The OP commented on my answer which reveals their original misunderstanding: that reordering could occur around `call` as if it were any other instruction. It can't. Reordering happens against the "dynamic trace". — BeeOnRope, Jun 20 '18 at 19:25
I am also a newbie on processor reordering, so although I'm undoubtedly unable to give a better answer than the experts here, I have a better understanding what the OP was asking, because I had the same bepuzzlement when reading this paragraph. He was not asking how `pthread_mutex_lock/pthread_mutex_unlock` implements CPU barrier, but why these mutex functions are obliged to implement said CPU barrier from the meaning of mutex. Below I will give my answer to the first question ("Now what prevents ..."), in a newbie for newbie way, ... — zzzhhh, Jun 12 '21 at 06:41
... first of all, general speaking, the `pthread_mutex_lock()` is nothing more than a sequence of instructions, so general speaking again, the CPU has no idea that it is so special a function that processor memeory reordering is prohibited. So, general speaking the third time, nothing can prevent the CPU from reordering `mov 10 into i` to above `call pthread_mutex_lock()`. As a result, it is natural to ask this question. Now let me answer below, starting with the meaning of mutex. ... — zzzhhh, Jun 12 '21 at 06:42
... We all know the meaning of mutex is specify the critical section, as written in all OS/Parallel Programming textbooks. Critical section is defined to be executed by only one thread at most. So, if code like `mov 10 into i` is reordered above `call pthread_mutex_lock()` in thread 1, thread 2 might be running the critical section at that point because thread 1 has not yet acquired the mutex, which is a violation of meaning of mutex. You can find a lot of examples of disasters caused by running critical section code by two threads simultaneously. ... — zzzhhh, Jun 12 '21 at 06:43
... So, boiling down, it is the meaning of mutex that prevents memory reordering. This explains why `pthread_mutex_lock()` has to serve as a memory barrier. Of course, simply naming a function mutex_lock or something like does not mean it will function as a mutex acquisition; we have to implement it. In fact, we have to not only implement the traditional mutex acquisition which is written in many textbooks, but also the acquire semantics. As is said in the next paragraph of the preshing article, "Every implementation of a lock, ..., should provide these guarantees." ... — zzzhhh, Jun 12 '21 at 06:44
... So, from a root viewpoint, a memory barrier is put in `call pthread_mutex_lock()` solely out of conscientiousness and respect of the semantic of mutex. In practice, it is likely that a programmer failed to write CPU memory barrier in the function but added it later as a bug fix, or happened to use some instructions that automatically implement the barrier without the programmer even knowing it. These implementation details were described in great details by the expert answers, so I will not repeat. The same argument can be applied for `pthread_mutex_unlock()` to observe release semantics. — zzzhhh, Jun 12 '21 at 06:46

BeeOnRope · Accepted Answer · 2018-06-20T19:44:22.830

9

The short answer is that the body of the pthread_mutex_lock and pthread_mutex_unlock calls will include the necessary platform-specific memory barriers which will prevent the CPU from moving memory accesses within the critical section outside of it. The instruction flow will move from the calling code into the lock and unlock functions via a call instruction, and it is this dynamic instruction trace you have to consider for the purposes of reordering - not the static sequence you see in an assembly listing.

On x86 specifically, you probably won't find explicit, standalone memory barriers inside those methods, since you'll already have lock-prefixed instructions in order to perform the actual locking and unlocking atomically, and these instructions imply a full memory barrier, which prevents the CPU reordering you are concerned about.

For example, on my Ubuntu 16.04 system with glibc 2.23, pthread_mutex_lock is implemented using a lock cmpxchg (compare-and-exchange) and pthread_mutex_unlock is implemented using lock dec (decrement), both of which have full barrier semantics.

edited Jun 20 '18 at 19:44

answered Jun 20 '18 at 16:05

BeeOnRope

60,350
16
207
386

*"The short answer is that the body of the pthread_mutex_lock and pthread_mutex_unlock calls will include the necessary platform-specific memory barriers which will prevent the CPU from moving memory accesses within the critical section outside of it."* I don't underatnd why the body of these functions matters, I mean `i = 10;` could be executed before `pthread_mutex_lock(&lock);` is executed, and so what is inside the critical section has been moved to the outside of the critical section. Or is such reordering not allowed on all CPU architectures? – user8426277 Jun 20 '18 at 18:27
2

@user8426277 - because it's the _dynamic_ flow of instructions that (mostly) matters for re-ordering. The CPU doesn't execute the `call` as a single instruction and then continue on to the next instruction in source/assembly order, it goes _into_ the body of the `pthread` function. So for the purpose of analysis, you can imagine the entire body of the `lock` and `unlock` calls sort of "inlined" in the assembly at the place the `call` instruction appears. If that weren't the case, mutexes and many other sync mechanisms would be impossible to implement as plain function calls! – BeeOnRope Jun 20 '18 at 19:15
1

@user8426277 - I updated my question to make it clearer. The bottom line is that, at the assembly level, you can't talk about what types of reordering are possible across a `call`, `jmp` or any other control flow change because you need to know what happens at the location jumped to. – BeeOnRope Jun 20 '18 at 19:17
x86 `call` does a release-store (pushing the return address), so if you want to be pedantic the `call` itself *does* actually have an effect. But no moreso than any normal stores done inside the function. But yes, very good update, I think that might be getting at the heart of what the OP is missing. – Peter Cordes Jun 20 '18 at 19:24
1

@PeterCordes I never say that `call` doesn't have release effect do I? I'm saying you need to know what is inside the `call` to know about the entire effect of `call` on the instruction trace. Of course, the `call` _itself_ may have some ordering semantics, such as those associated with the store to the stack, but that is practically unusable and is IMO a total red herring (and another answer covers it well :-). Note that release-acquire talks about a location: the `lock` part of a mutex doesn't synchronize with a `push` the return address on another thread, so that release isn't very useful. – BeeOnRope Jun 20 '18 at 19:26
I just meant that in reply to *you can't talk about what types of reordering are possible across a `call`...*" in your 2nd comment. Not suggesting you overcomplicate this answer with that, because I already did that in mine :P Agreed that it's a total red herring, and the memory-ordering part of `call` is totally unimportant because other threads aren't looking at your call stack. – Peter Cordes Jun 20 '18 at 19:30
1

@PeterCordes - well yes, I wonder if someone would come along say something like that, and I half-considered making it more price, like noting that the `call` itself may prevent some reorderings, but the body may then prevent _more_ reordering. That said, I actually don't think it's meaningful to say that `call` has "release semantics" (with the implication perhaps being that you don't need a store in side the `unlock` implementation): _release_ is meaningful when talking about a particular _location_ and in the case of `call` the location is an anonymous location on the stack. – BeeOnRope Jun 20 '18 at 19:33
Ah right, that's a good analysis of exactly why it's not meaningful or useful. It doesn't matter where `call`'s store appears in the global order, just stores outside the critical section vs. the mutex lock vs. inside. I knew I was being pedantic, but hadn't thought through exactly how useless that was. If we're talking about the non-store part of instructions, then it's another out-of-order exec vs. memory ordering question. – Peter Cordes Jun 20 '18 at 19:34
... because in what sense can the acquire side (the `lock` call) synchronize with the store implied by `call`? I don't think it can. I can't see a way you could rely on the so-called "release" semantics of the store. [<-- this part of my comment raced with your last reply]. So we should _not_ think of release stores as some type of memory barrier really: putting a "dummy" store with "release" semantics doesn't do anything meaningful. It's only useful to something that consumes its value, to establish a type of happens-before relationship or however you want to think of it. – BeeOnRope Jun 20 '18 at 19:36
@BeeOnRope *"The CPU doesn't execute the call as a single instruction and then continue on to the next instruction in source/assembly order, it goes into the body of the pthread function"* I didn't mean in my question in the comments that I thought that the CPU will execute `pthread_mutex_lock(&lock);` and then immediately afterwards the CPU will execute `i = 10;` without going to the function body first. What I meant was: what if the CPU completely ignored `pthread_mutex_lock(&lock);` and instead executed `i = 10;` first, and then after that, the CPU executed `pthread_mutex_lock(&lock);`. – user8426277 Jun 21 '18 at 22:03
1

@user8426277 - right, I understood. What I am saying though is that the CPU doesn't reorder at the assembly level listing. You are imagining assembly like `call mutex_lock; store 10 to I;` and asking "what happens if the CPU reorders the store before the `call`? The answer is that wasn't the right way to look at it in the first place: the CPU doesn't reorder against the static assembly you find on disk in the binary, it operates against the _dynamic_ stream of instructions, so you need to mentally trace out the execution and record that stream. – BeeOnRope Jun 22 '18 at 17:34
1

In this case the result is simple: you just end up with the instructions that are part of the body of the `lock` and `unlock` functions replacing the call instruction. _Then_ you can talk about reordering against that list of instructions - that's what I call the _dynamic trace_. – BeeOnRope Jun 22 '18 at 17:35
Oh! in that case, is the purpose of the memory barrier inside the `pthread_mutex_lock(&lock);` function is to prevent the reordering of `i = 10;` to above the code that sets/locks the mutex variable inside the `pthread_mutex_lock(&lock);` function? – user8426277 Jun 22 '18 at 19:45
@user - correct. As a consequence it also prevents it from moving above the lock function enirely. – BeeOnRope Jun 22 '18 at 19:47

Peter Cordes · Answer 2 · 2019-11-04T08:21:00.070

If i and j are local variables, nothing. The compiler can keep them in registers across the function call if it can prove that nothing outside the current function have their address.

But any global variables, or locals whose address might be stored in a global, do have to be "in sync" in memory for a non-inline function call. The compiler has to assume that any function call it can't inline modifies any / every variable it can possibly have a reference to.

So for example, if int i; is a local variable, after sscanf("0", "%d", &i); its address will have escaped the function and the compiler will then have to spill/reload it around function calls instead of keeping it in a call-preserved register.

See my answer on Understanding volatile asm vs volatile variable, with an example of asm volatile("":::"memory") being a barrier for a local variable whose address escaped the function (sscanf("0", "%d", &i);), but not for locals that are still purely local. It's exactly the same behaviour for exactly the same reason.

I assume that the above quote is talking about CPU reordering and not about compiler reordering.

It's talking about both, because both are necessary for correctness.

This is why the compiler can't reorder updates to shared variables with any function call. (This is very important: the weak C11 memory model allows lots of compile-time reordering. The strong x86 memory model only allows StoreLoad reordering, and local store-forwarding.)

pthread_mutex_lock being a non-inline function call takes care of compile-time reordering, and the fact that it does a locked operation, an atomic RMW, also means it includes a full runtime memory barrier on x86. (Not the call instruction itself, though, just the code in the function body.) This gives it acquire semantics.

Unlocking a spinlock only needs a release-store, not a RMW, so depending on the implementation details the unlock function might not be a StoreLoad barrier. (This is still ok: it keeps everything in the critical section from getting out. It's not necessary to stop later operations from appearing before the unlock. See Jeff Preshing's article explaining Acquire and Release semantics)

On a weakly-ordered ISA, those mutex functions would run barrier instructions, like ARM dmb (data memory barrier). Normal functions wouldn't, so the author of that guide is correct to point out that those functions are special.

Now what prevents the CPU from reordering mov 10 into i and mov 20 into j to above call pthread_mutex_lock()

This isn't the important reason (because on a weakly-ordered ISA pthread_mutex_unlock would run a barrier instruction), but it is actually true on x86 that the stores can't even be reorder with the call instruction, let alone actual locking/unlocking of the mutex done by the function body before the function returns.

x86 has strong memory-ordering semantics (stores don't reorder with other stores), and call is a store (pushing the return address).

So mov [i], 10 must appear in the global store between the stores done by the call instruction.

Of course in a normal program, nobody is observing the call stack of other threads, just the xchg to take the mutex or the release-store to release it in pthread_mutex_unlock.

Being a non-inline function call is not necessarily enough to prevent compile-time reordering across the call. For example, if a translation unit has file-scope static variables, and the compiler can prove that their addresses are not accessible to the call, then the compiler is free to reorder accesses to those variables, across the call, even though those variables could be accessed by more than one thread. — Arch D. Robison, Jun 21 '18 at 23:38
@ArchD.Robison: A compiler could only prove that if the only functions that touch the `static` var are themselves `static`, and not called from any non-`static` functions and don't have the function address passed to any black-box functions. Otherwise it has to assume that the unknown function can call back into this translation unit to eventually reach a function that reads or writes the static var. Thread-creation and signal-handler functions are a black-box, so there's no way to start another thread running a `static` function without having static var access "escape" the translation unit. — Peter Cordes, Jun 21 '18 at 23:47
@PeterCordes If you have a set of variables V only manipulated by functions F1 that are only called by the set of functions F2 and none of these can be named outside the TU, and no other function inside the TU takes the address to put it in an object accessible elsewhere... at the end of the day, V, F1, F2... serve no purpose at all. — curiousguy, Jun 24 '18 at 03:02
@ArchD.Robison Presumably, in a realistic program static variables can be indirectly **accessed from main**. — curiousguy, Jun 24 '18 at 03:08
@curiousguy: Good point. Unless this is the TU containing `main`, then maybe the compiler could prove that those variables can only be used by the main thread (if it optimized based on it being UB to call `main` again from inside the program). I don't think there are any cases where a non-inline function call can fail to order any variables that are actually shared. — Peter Cordes, Jun 24 '18 at 13:22
I concur with @PeterCordes' analysis. I had forgotten to consider the "call back in" possibility. — Arch D. Robison, Jun 24 '18 at 23:40
"_if it can prove that nothing outside the current function have their address._" Is there a name for that? Function private variables? — curiousguy, Nov 04 '19 at 04:41
@curiousguy: https://en.wikipedia.org/wiki/Escape_analysis is the name for the checking. I'm not aware of a widely-established name for variables that meet the criteria. I'd maybe say "variable that hasn't escaped". — Peter Cordes, Nov 04 '19 at 04:44

How does a mutex lock and unlock functions prevents CPU reordering?

2 Answers2

Linked