Lock correctness with compiler optimizations (C / gcc)

Question

When discussing concurrency with my professor, he mentioned that potential compiler optimizations around locks (re-ordering instructions, optimizing accesses, etc.), like pthread_mutex_lock() could cause problems. The reason this doesn’t occur (according to him) is that the compiler has to treat function calls like a black box and is unable to optimize accesses or re-order around it, as it does not know what the function may do to global state. He said that if you turn on something like link-time optimization, gcc -flto, you may find that your locks suddenly stop working.

He is a fairly reputable source, an old-school GNU guy who developed a lot of GNU core-utils and has even worked on gcc, but I am left wondering how this can be true, when the compiler is able to optimize things like memcpy to a single instruction, effectively looking across module boundaries without link-time optimization. Could it not then do the same to something like pthread_mutex_lock() and optimize accesses around it?

If this is true and the compiler can peer into the pthread_mutex_lock() function and sees it does not alter a certain variable, so it optimizes accesses, could this affect correctness, and if so how can such a core function be left so vulnerable to the possibility of not working correctly? Does this mean there has to be some other method employed to tell the compiler not to optimize accesses to these variables, such as the volatile construct? This gets even trickier when considering static locks that are module-specific, since a function defined in a different module could not possibly access it.

That's correct, compilers have to treat non-inline functions as black boxes that could read+write any global. See [How does a mutex lock and unlock functions prevents CPU reordering?](https://stackoverflow.com/q/50951011) If you wrote `phread_mutex_lock()` as an inline function or compiled libpthread with `-flto`, you'd need to include an `asm("" ::: "memory")` compiler barrier or something. GCC *does* know about some common standard functions like `memcpy`, e.g. treating it as `__builtin_memcpy` by default. [What are \_\_builtin\_\_functions for in C++?](//stackoverflow.com/q/28208637) — Peter Cordes, Jul 03 '23 at 18:43

John Bollinger · Answer 1 · 2023-05-07T12:52:18.480

The reason this doesn’t occur (according to him) is that the compiler has to treat function calls like a black box and is unable to optimize accesses or re-order around it,

That's one reason, certainly.

Another reason that it might not occur specifically with the GNU toolchain is that GNU's link-time optimization applies only to functions that are compiled with the -flto option, and therefore carry with them additional information on which the optimization relies. All one needs to prevent GCC's link-time optimization from messing with pthread_mutex_lock() is to have a version of that function that does not carry the LTO information. (This ordinarily would be the responsibility of the system, not of the application developer.)

if you turn on something like link-time optimization, gcc -flto, you may find that your locks suddenly stop working.

Possibly. And if you do see that, then it constitutes a flaw in your C implementation, whether you attribute it to compiler, linker, library, or all of the above. This is not to say that you should discount the possibility, but rather that it is something that people working on these tools are out to avoid and / or fix, so the likelihood of running into such an issue goes down over time.

I am left wondering how this can be true, when the compiler is able to optimize things like memcpy to a single instruction, effectively looking across module boundaries without link-time optimization. Could it not then do the same to something like pthread_mutex_lock() and optimize accesses around it?

What you're talking about now is optimization of specific functions. For example, the compiler knows all about memcpy() in particular, based on its specifications, and in some cases based on being part of the same, integrated C implementation. It can optimize (say) some memcpy calls at compile time because it knows what that function is supposed to do, and it recognizes usage idioms that it can optimize without actually looking into the library at all.

A compiler could, in principle, do the same with pthread_mutex_lock(), but this would not be a problem because such specific-function optimizations are aware of (and rely on) the semantics of the function involved. There's no reason to think that such an optimzation of pthread_mutex_lock() would fail to preserve that function's well-documented memory-order semantics.

If this is true and the compiler can peer into the pthread_mutex_lock() function and sees it does not alter a certain variable, so it optimizes accesses, could this affect correctness,

Can compilers have bugs? Yes, they can and do.

Do current versions of any C compilers have a specific bug along those lines? I don't know.

and if so how can such a core function be left so vulnerable to the possibility of not working correctly?

Nobody is leaving functions open to such failures. To the extent that opportunities for such incorrect optimizations exist, "people" are interested in and motivated to fix their compilers, linkers, and libraries to close those holes.

Understand also that it is common for new optimizations to be tested carefully for an extended period -- sometimes years and multiple compiler versions -- before being designated safe for production use. If ever they are.

Does this mean there has to be some other method employed to tell the compiler not to optimize accesses to these variables, such as the volatile construct?

No. Generally speaking, you should rely on functions and language constructs to behave according to their documentation. Especially functions defined by the C language itself. Almost as much so functions defined by the platform's core specifications, such as POSIX on a POSIX-conforming system.

Also, generally speaking, you should approach compiler options with care and diligence. Some produce non-conforming behavior by design. Some come with caveats. And if I see an option that comes with such lengthy documentation as GCC's -flto does, I usually take it as a sign that it is an expert feature that I shouldn't mess with unless I've invested the time and effort to make myself an expert with it.

score 3 · Answer 2 · answered May 07 '23 at 20:37

In the very old days, yes, multithreading had to rely on things like opaque function boundaries to serve as a makeshift memory barrier. It's true that LTO would break that. Fortunately, this was before LTO was widely implemented.

In modern times, compilers support explicit memory barriers. For gcc in particular, there are several possible approaches:

At the most basic level, you can (ab)use gcc's extended asm: asm("" : : : "memory"); will force the compiler to assume that any object in memory could be read or written, so that the compiler cannot reorder load and store instructions around it. This could be used in conjunction with an inline asm barrier instruction, which prevents the CPU itself from reordering the visibility of loads and stores.
Later, gcc introduced a series of synchronization builtins. So you could insert __sync_synchronize() into the code, which would likewise prevent compiler reordering and insert barrier instructions as needed. More likely you'd already be using one of the other read-modify-write primitives there, which also insert a full barrier.
From C11 onward, the language has a formal memory model, that in essence provides guarantees as to exactly what kinds of reorderings are allowed. This includes standard functions like atomic_thread_fence that the compiler must handle specially, again producing a barrier of an appropriate kind, so as to inhibit unwanted reordering.

So, using at least one of these methods, the implementation of pthread_mutex_lock would include a memory barrier. Then it can be inlined as much as you like, whether at compile or at link time, and the barrier will still be respected, preventing any reorderings that would violate its semantics.

score 2 · Answer 3 · edited May 08 '23 at 01:48

2

To give your prof a break, what they said has been true for the last 70+ years (50 years for C). Will it remain true? Tough question.

The language "C" is no longer a lightweight systems programming language. Thanks to the relentless efforts of a series of Karen committees; it now includes precise specifications for unrelated libraries, and behaviour that is prefers micro-optimising compilers to a general systems programming language. If one were unkind, they are taking out the failure of their preferred language ( Pascal, Modula-2, Euclid, soon-to-be rust ) by undermining "C"; but that is a side issue.

The compiler can recognize memcpy because it is defined by stdc. Pthread_mutex_mumble is not defined by stdc, so it can't. (Yet, its growing).

In general, the compiler cannot look beyond its compilation unit (= source file) to get hints about what is going on. This changed a fair bit in 1991 when Plan9 introduced "Link-Optimising-Compilers". Of course, the plan/9 folks were just interested in better technology, not winning micro benchmarks.

Fast forward 22 years, and it is all benchmarks baby; and compilers have recently caught up to Link Optimising; so know, simply hiding behind an external function call is not enough.

Fortunately, pthread mutex lock, for most implementations looks something like:

id = get_my_thread_id();
if  (cswap(mutex, 0, id) != 0) {
    _syscall_mutex_wait(mutex);
}
return 0;

which would be incomprehensible to a link-time-optimiser, so your prof's initial statement stands.

As a final; since pthread_mutex_lock forces an acquire-release semantic at the minimum; and possibly a full discard in the case of a system call; it guarantees that the various updates made during its lock achieve global visibility before another locker can acquire it. This is extremely important.

edited May 08 '23 at 01:48

blackgreen

34,072
23
111
129

answered May 07 '23 at 00:47

mevets

10,070
1
21
33

What do you mean by 'a full discard in the case of a system call'? And when you say 'incomprehensible to the link time optimizer', you are saying that even with `-flto`, locks would maintain correctness? – user129393192 May 07 '23 at 00:56
2

Oh, was just going to give this an upvote, but seeing that 9,999 score, perhaps not. – BoP May 07 '23 at 10:44
1

The system call implements a full barrier, on all? architectures. Partly this is necessity - the programming models surrounding locking (mutex, semaphore, cv, etc... ) do not require atomics to be applied to shared variables; so when one releases a lock it must behave like a write barrier. Similarly, as your process is about to wait for a lock, it cannot be speculatively loading shared variables; thus the operation must behave like a read barrier. It is also very difficult for the processor to maintain such state across protection boundaries. – mevets May 07 '23 at 12:28
The lto can only work with what is available -- standard defined functions and machine opcodes. It cannot see that `int 0x21` might lead to a given function within an unknown OS; so must assume that all memory is reachable by that call, thus needs to be updated as the program specified. – mevets May 07 '23 at 12:30
*compilers have recently caught up to Link Optimising* Recently?!?! I was using `-xipo` on Sun's compilers to do link-time optimization back in the 1990s. On 64-bit systems even. Next thing you know, these Linux geniuses are going to reinvent Solaris zones - **badly**. And congratulate themselves on how clever they are. Oh, wait, they already did, and called it "Docker". ;-) – Andrew Henle May 07 '23 at 13:07
And sorry about the +1 "hit" that pushed your rep over 10K. :-D – Andrew Henle May 07 '23 at 13:11

Lock correctness with compiler optimizations (C / gcc)

3 Answers3