Do I need to write explicit memory barrier for multithreaded C code?

Question

I'm writing some code on Linux using pthread multithreading library and I'm currently wondering if following code is safe when compiled with -Ofast -lto -pthread.

// shared global
long shared_event_count = 0;
// ...
pthread_mutex_lock(mutex);
while (shared_event_count <= *last_seen_event_count)
    pthread_cond_wait(cond, mutex);
*last_seen_event_count = shared_event_count;
pthread_mutex_unlock(mutex);

Are the calls to pthread_* functions enough or should I also include memory barrier to make sure that the change to global variable shared_event_count is actually updated during the loop? Without memory barrier the compiler would be freely to optimize the variable as register integer only, right? Of course, I could declare the shared integer as volatile which would prevent keeping the contents of that variable in register only during the loop but if I used the variable multiple times within the loop, it could make sense to only check the fresh status for the loop conditition only because that could allow for more compiler optimizations.

From testing the above code as-is, it appears that the generated code actually sees the changes made by another thread. However, is there any spec or documentation that actually guarantees this?

The common solution seems to be "don't optimize multithreaded code too aggressively" but that seems like a poor man's workaround instead of really fixing the issue. I'd rather write correct code and let the compiler optimize as much as possible within the specs (any code that gets broken by optimizations is in reality using e.g. undefined behavior of C standard as assumed stable behavior, except for some rare cases where compiler actually outputs invalid code but that seems to be very very rare these days).

I'd much prefer writing the code that works with any optimizing compiler – as such, it should only use features specified in the C standard and the pthread library documentation.

I found an interesting article at https://www.alibabacloud.com/blog/597460 which contains a trick like this:

#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))

This was actually used first in Linux kernel and it triggered a compiler bug in old GCC versions: https://lwn.net/Articles/624126/

As such, let's assume that the compiler is actually following the spec and doesn't contain a bug but implements every possible optimization known to man allowed by the specs. Is the above code safe with that assumption?

Also, does pthread_mutex_lock() include memory barrier by the spec or could compiler re-order the statements around it?

Does POSIX 1-2008's requirement that [_"[`pthread_mutex_lock` shall \] synchronize memory with respect to other threads"_](https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_12) qualify as "by the spec"? — pilcrow, Jul 20 '22 at 14:37
[Ref1](https://stackoverflow.com/q/24137964/589924): "Any function whose definition is not available in this translation unit (and is not intrinsic) is a *compiler* memory barrier." (This is surely from the C spec?) It goes on: "pthread_mutex_lock, pthread_mutex_unlock, pthread_create also issue a hardware memory barrier to prevent the CPU from reordering reads and writes." No spec mentioned, but it's marked of [Ref2](https://stackoverflow.com/q/3208060/589924), which refers to the POSIX spec. Dup? — ikegami, Jul 20 '22 at 14:42
How about `-lto` or Link Time Optimization. Does that affect what is considered external translation unit? — Mikko Rantalainen, Jul 20 '22 at 14:43
I interpret the POSIX 1-2008 wording as *the order the threads can access the memory is synchronized* but that doesn't say anything about if the compiler is required to actually write e.g. register contents back to memory locations before calling the `pthread_*` functions. It might be the intent but wording is not clear to me. — Mikko Rantalainen, Jul 20 '22 at 14:47
Re "*(This is surely from the C spec?)*", Doesn't seem to be. At least not directly. It might be derivable? — ikegami, Jul 20 '22 at 14:49
And I would consider any POSIX spec as totally valid for the "by the spec". As long as the spec is not only GCC or some other specific implementation, it seems fine for me. — Mikko Rantalainen, Jul 20 '22 at 14:50
Re "*I interpret the POSIX 1-2008 wording*", No, memory synchronization is the very thing memory barriers aka memory fences provide, by definition. "A synchronization operation without an associated memory location is a fence" (C spec) — ikegami, Jul 20 '22 at 14:52
Okay, that makes sense. Do I understand correctly that the `volatile` flag for shared state is still required because barrier doesn't guarantee that unless `pthread_*` function is considered external translation unit in all cases (including link time optimization)? — Mikko Rantalainen, Jul 20 '22 at 14:57
? no mention of translation units in any of the POSIX stuff we quoted — ikegami, Jul 20 '22 at 14:58
Also see dicussion here: https://stackoverflow.com/a/2485177/334451 – according to it, Microsoft C compiler implements `volatile` as memory barrier but that's not required by the C standard. — Mikko Rantalainen, Jul 20 '22 at 14:58
Correct, `volatile` is not a solution if there was a problem, but we established there is no problem. — ikegami, Jul 20 '22 at 14:59

score 4 · Accepted Answer · answered Jul 20 '22 at 15:09

4

The compiler will not reorder memory accesses across pthread_mutex_lock() (this is an oversimplification and not strictly true, see below).

First I’ll justify this by talking about how compilers work, then I’ll justify this by looking at the spec, and then I’ll justify this by talking about convention.

I don’t think I can give you a perfect justification from the spec. In general, I would not expect a spec to give you a perfect justification—it’s turtles all the way down (do you have a spec for how to interpret the spec?), and the spec is designed to be read and understood by actual humans who understand the relevant background concepts.

How This Works

How this works—the compiler, by default, assumes that a function it doesn’t know can access any global variable. So it must emit the store to shared_event_count before the call to pthread_mutex_lock()—as far as the compiler knows, pthread_mutex_lock() reads the value of shared_event_count.

Inside pthread_mutex_lock is a memory fence for the CPU, if necessary.

Justification

From n1548:

In the abstract machine, all expressions are evaluated as specified by the semantics. An actual implementation need not evaluate part of an expression if it can deduce that its value is not used and that no needed side effects are produced (including any caused by calling a function or accessing a volatile object).

Yes, there’s LTO. LTO can do some very surprising things. However, the fact is that writing to shared_event_count does have side effects and those side effects do affect the behavior of pthread_mutex_lock() and pthread_mutex_unlock().

The POSIX spec states that pthread_mutex_lock() provides synchronization. I could not find an explanation in the POSIX spec of what synchronization is, so this may have to suffice.

POSIX 4.12

Applications shall ensure that access to any memory location by more than one thread of control (threads or processes) is restricted such that no thread of control can read or modify a memory location while another thread of control may be modifying it. Such access is restricted using functions that synchronize thread execution and also synchronize memory with respect to other threads. The following functions synchronize memory with respect to other threads:

Yes, in theory the store to shared_event_count could be moved or eliminated—but the compiler would have to somehow prove that this transformation is legal. There are various ways you could imagine this happening. For example, the compiler might be configured to do “whole program optimization”, and it may observe that shared_event_count is never read by your program—at which point, it’s a dead store and can be eliminated by the compiler.

Convention

This is how pthread_mutex_lock() has been used since the dawn of time. If compilers did this optimization, pretty much everyone’s code would break.

Volatile

I would generally not use this macro in ordinary code:

#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))

This is a weird thing to do, useful in weird situations. Ordinary multithreaded code is not a sufficiently weird situation to use tricks like this. Generally, you want to either use locks or atomics to read or write shared values in a multithreaded program. ACCESS_ONCE does not use a lock and it does not use atomics—so, what purpose would you use it for? Why wouldn’t you use atomic_store() or atomic_load()?

In other words, it is unclear what you would be trying to do with volatile. The volatile keyword is easily the most abused keyword in C. It is rarely useful except when writing to memory-mapped IO registers.

Conclusion

The code is fine. Don’t use volatile.

answered Jul 20 '22 at 15:09

Dietrich Epp

205,541
37
345
415

I saw a long conversation appear on this question while I was writing this long answer—makes me worried that someone’s going to jump on me once I post the answer. – Dietrich Epp Jul 20 '22 at 15:12
I'm marking this answer as the accepted one at least for now. The only partially missing part is that if LTO could (in my opinion) eliminate the move to `shared_event_count` if it can prove it will not be accessed by the *same thread* (that is, the code it inlined during linking). I know C standard provides very little guarantees about memory ordering or operation ordering between threads so I'm always assuming this would also include LTO but I don't know it well enough to tell for sure. – Mikko Rantalainen Jul 20 '22 at 15:16
@MikkoRantalainen: Is there a reason why my citation of the spec is not sufficient for that? What you are saying seems to contradict the part of the C spec that I already quoted. – Dietrich Epp Jul 20 '22 at 15:20
I understand that you are looking for justification from the spec, but no spec will explain itself infinitely—at some point, any spec must necessarily rely on terms that are defined elsewhere, and you must bring your own definitions. – Dietrich Epp Jul 20 '22 at 15:20
I'm reading the part "An actual implementation need not evaluate part of an expression if it can deduce that its value is not used and that no needed side effects are produced (including any caused by calling a function or accessing a volatile object)" as if LTO can detect that the called function doesn't use the side-effects of storing the non-volatile `shared_event_count` so it can be kept in register only. Maybe my interpretation is just too pessimistic and this should also include side-effects observable by other threads but I assume it's only the directly called code that's considered. – Mikko Rantalainen Jul 20 '22 at 15:28
That said, I think LTO should be able to see the memory barrier inside the called `pthread_*` function which would require storing the `shared_event_count` in RAM if I've understood correctly. – Mikko Rantalainen Jul 20 '22 at 15:29
The problem is that the store to `shared_event_count` does, in actual fact, affect the behavior `pthread_mutex_lock()`. The `pthread_mutex_lock()` synchronizes memory, and if the compiler moved the `shared_event_count` across the lock it would no longer be synchronized by `pthread_mutex_lock()`. I don’t know what this has to do with LTO “seeing the memory barrier”—LTO has to comply with the spec, we don’t actually have to know how LTO works. – Dietrich Epp Jul 20 '22 at 15:41
1

You might have the question, “How does the compiler know that `pthread_mutex_lock()` synchronizes memory?” There are two answers. Either the compiler has special information about how `pthread_mutex_lock()` works, or it does not. If the compiler has no special information about `pthread_mutex_lock()`, then it would have to assume that the function can do *anything,* such as read the value of `shared_event_count` or synchronize memory. If the compiler knows what `pthread_mutex_lock()` does, then it knows the optimization is invalid. – Dietrich Epp Jul 20 '22 at 15:43
1

You seem to be imagining a scenario where—as far as I can tell—the compiler peeks inside the implementation of `pthread_mutex_lock()`, and, in violation of the standard, derives incorrect assumptions about what the function does. This would violate the POSIX spec, which states that `pthread_mutex_lock()` synchronizes memory. On a POSIX system, the C compiler must not only conform to the C spec, but to the POSIX spec as well. – Dietrich Epp Jul 20 '22 at 15:45
I think your explanation "Either the compiler has special information about how `pthread_mutex_lock()` works [...] If the compiler has no special information about `pthread_mutex_lock()`, then it would have to assume that the function can do anything, such as read the value of `shared_event_count` or synchronize memory." makes perfect sense. I think it would be worth adding that to the answer in the segment that speaks about LTO. – Mikko Rantalainen Jul 20 '22 at 15:59
1

FYI, see [this question from three days ago](https://stackoverflow.com/questions/73009171/is-volatile-necessary-for-the-resource-used-in-a-critical-section). Note that the threading semantics in C 2018 5.1.2.4 seem to require, in a hosted implementations supporting threads, that all object updates be written to memory before any external call and read (when used) after any external call, except for objects the compiler can strictly deduce do not have multithreading complications, because any external call could contain a call to the standard C lock/release routines… – Eric Postpischil Jul 20 '22 at 23:59
1

… Note 2 in 5.1.2.4 6 states this clearly: “… Informally, performing a release operation on A forces prior side effects on other memory locations to become visible to other threads that later perform an acquire or consume operation on A…,” but I have not traced that to the normative text that implies it. – Eric Postpischil Jul 21 '22 at 00:00