How C++ Standard prevents deadlock in spinlock mutex with memory_order_acquire and memory_order_release?

Question

TL:DR: if a mutex implementation uses acquire and release operations, could an implementation do compile-time reordering like would normally be allowed and overlap two critical sections that should be independent, from different locks? This would lead to a potential deadlock.

Assume a mutex is inmplement on std::atomic_flag:

struct mutex
{
   void lock() 
   {
       while (lock.test_and_set(std::memory_order_acquire)) 
       {
          yield_execution();
       }
   }

   void unlock()
   {
       lock.clear(std::memory_order_release);
   }

   std::atomic_flag lock; // = ATOMIC_FLAG_INIT in pre-C++20
};

So far looks ok, regarding using single such mutex: std::memory_order_release is sychronized with std::memory_order_acquire.

The use of std::memory_order_acquire/std::memory_order_release here should not raise questions at the first sight. They are similar to cppreference example https://en.cppreference.com/w/cpp/atomic/atomic_flag

Now there are two mutexes guarding different variables, and two threads accessing them in different order:

mutex m1;
data  v1;

mutex m2;
data  v2;

void threadA()
{
    m1.lock();
    v1.use();
    m1.unlock();

    m2.lock();
    v2.use();
    m2.unlock();
}

void threadB()
{
    m2.lock();
    v2.use();
    m2.unlock();

    m1.lock();
    v1.use();
    m1.unlock();
}

Release operations can be reordered after unrelated acquire operation (unrelated operation == a later operation on a different object), so the execution could be transformed as follows:

mutex m1;
data  v1;

mutex m2;
data  v2;

void threadA()
{
    m1.lock();
    v1.use();

    m2.lock();
    m1.unlock();

    v2.use();
    m2.unlock();
}

void threadB()
{
    m2.lock();
    v2.use();

    m1.lock();
    m2.unlock();

    v1.use();
    m1.unlock();
}

So it looks like there is a deadlock.

Questions:

How Standard prevents from having such mutexes?
What is the best way to have spin lock mutex not suffering from this issue?
Is the unmodified mutex from the top of this post usable for some cases?

(Not a duplicate of C++11 memory_order_acquire and memory_order_release semantics?, though it is in the same area)

"*Release operations can be reordered after unrelated acquire operation*" What is an "unrelated acquire operation"? — Nicol Bolas, Apr 19 '20 at 05:03
*Release operations can be reordered after unrelated acquire operation* - note that the C++ standard does *not* specify things in those terms. It specifies in terms of what values a thread can see after doing an acquire load that synchronizes-with a release-store (or a release-sequence). But generally compilers making asm for some target ISA can at least in theory do the reordering you suggest at compile time. Certainly for a plain release store, but IDK if real compilers would do it for a release RMW. — Peter Cordes, Apr 19 '20 at 05:03
@NicolBolas: unrelated = a later acquire operation on a different object. So in terms of a 1-way-barrier model like shown in https://preshing.com/20120913/acquire-and-release-semantics/, they can reorder. — Peter Cordes, Apr 19 '20 at 05:03
@PeterCordes that's correct what I meant by unrelated expanded post with your explanation — Alex Guteniev, Apr 19 '20 at 05:14
I added a TL:DR at the top so future readers don't have to wade through a long question and read the details of your code before they even know what problem you have in mind. — Peter Cordes, Apr 19 '20 at 05:21
@PeterCordes,thanks for TL:DR section. But there's emphasis on _compile-time_ reordering. Could such reordering also happen in runtime? See: https://godbolt.org/z/A3Te52 , I see that lock release is compiled as `mov DWORD PTR std::atomic_flag f, 0 `, so I'm also concerned about `runtime` reordering. — Alex Guteniev, Apr 19 '20 at 06:22
@AlexGuteniev: If it happens once at runtime, who cares? It won't keep happening and stop the release from ever happening. There's obviously no problem in the pure ISO C++ standard itself, only in implementations that have to nail something down in asm for some ISA. — Peter Cordes, Apr 19 '20 at 06:23
So, you are saying that compiler barriers after `clear(release)` and before `test_and_set(acquire)` fixes the issue? And there's no problem in C++ itself, only with implementation, so these compiler barriers should be parts of `std::atomic_flag`, and not the responsibility of `std::atomic_flag` users? This may be the answer already. — Alex Guteniev, Apr 19 '20 at 06:36
No, I wasn't saying anything about compiler barriers being necessary. At that point I was still thinking through why this wasn't a real-world problem, and whether it was a QoI issue or a standards-compliance issue. I kept it short because I was part way through writing an answer about the retry loop being the key :P — Peter Cordes, Apr 19 '20 at 06:49

score 7 · Accepted Answer · answered Apr 19 '20 at 06:47

7

There's no problem in the ISO C++ standard; it doesn't distinguish compile-time vs. run-time reordering, and the code still has to execute as if it ran in source order on the C++ abstract machine. So the effects of m2.test_and_set(std::memory_order_acquire) trying to take the 2nd lock can become visible to other threads while still holding the first (i.e. before m1.reset), but failure there can't prevent m1 from ever being released.

The only way we'd have a problem is if compile-time reordering nailed down that order into asm for some machine, such that the m2 lock retry loop had to exit before actually releasing m1.

Also, ISO C++ only defines ordering in terms of synchronizes-with and what can see what, not in terms of re-ordering operations relative into some new order. That would imply some order existed. No such order that multiple threads can agree on is even guaranteed to exist for separate objects, unless you use seq_cst operations. (And a modification order for each object separately is guaranteed to exist.)

The 1-way-barrier model of acquire and release operations (like the diagram in https://preshing.com/20120913/acquire-and-release-semantics) is a convenient way to think about things, and matches reality for pure-loads and pure-stores on x86 and AArch64 for example. But as far as language-lawyering, it's not how the ISO C++ standard defines things.

You're reordering a whole retry loop, not just a single acquire

Reordering an atomic operation across a long-running loop is a theoretical problem allowed by the C++ standard. P0062R1: When should compilers optimize atomics? points out that delaying a store until after a long-running loop is technically allowed by standard's wording of 1.10p28:

An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.

But a potentially infinite loop would violate that, not being finite in the deadlock case for example, so compilers must not do that.

It's not "just" a quality-of-implementation issue. A successful mutex lock is an acquire operation, but you should not look at the retry loop as a single acquire operation. Any sane compiler won't.

(The classic example of something that aggressive atomics optimization could break is a progress bar, where the compiler sinks all the relaxed stores out of a loop and then folds all the dead stores into one final store of 100%. See also this Q&A - current compilers don't, and basically treat atomic as volatile atomic until C++ solves the problem of giving programmers a way to let the compiler know when atomics can/can't be optimized safely.)

answered Apr 19 '20 at 06:47

Peter Cordes

328,167
45
605
847

So if I implement `timed_mutex` and will try to lock for an hour, it looks like that until P0062R1 is addressed, Standard allows moving such `timed_mutex::unlock()` past that hour of other `timed_mutex::try_lock_for`. And if I implement this hour delay without access to some external clock, with number of iteration known at compile time (say, I know the minimum duration of `pause` instruction for some specific CPU), then a compiler might actually do that. Right? – Alex Guteniev Apr 19 '20 at 10:06
2

@AlexGuteniev: IDK, maybe in theory, but then we're into quality-of-implementation territory. A good compiler that people want to use isn't going to reorder an atomic past anything possibly lengthy. And in practice because it's hard to prove stuff, probably not past any loop whose trip-count isn't provably small at compile time. And not past any non-inline function call. Reordering past an `asm volatile` is plausible but I'd hope not, and certainly not in a loop. But if the loop was inside the asm statement, maybe spinning on `rdtsc` for 1 hour could be plausible. – Peter Cordes Apr 19 '20 at 10:26
1

Or maybe 20x `pause` might be something real-world code might do, although with `pause` being ~5 cycles before Skylake but ~100 core clock cycles on SKL, it's maybe unwise to use a fixed number of `pause` iterations. Hopefully [`tpause`](https://www.felixcloutier.com/x86/tpause) (with a TSC deadline) will be supported on mainstream CPUs later, not just Tremont atom. – Peter Cordes Apr 19 '20 at 10:28
"_The classic example of something that aggressive atomics optimization could break is a progress bar_" But then aggressive `write` optimization can break the `write(".",1);` console progress bar too. And nobody has ever discussed that optimization because it's insane. Also, the aggressive volatile write optimization can break probably almost any device driver and it's insane too. – curiousguy Jun 11 '20 at 02:15
1

@AlexGuteniev If you really are well versed into semantics, you will notice that nothing in C or C++ is defined. Nothing. No program has defined behavior. It's all QOI. – curiousguy Jun 11 '20 at 02:17
(I disagree with curiousguy's paranoia about nothing being well-defined in multi-threaded C++. Certainly some things that many people take for granted are not 100% nailed down by the standard, but some things truly are, and aren't just QOI or implementation details. They've argued this position on numerous SO Q&As without presenting convincing evidence such as an example of a real algorithm that's impossible to write in a way that a standards-compliant implementation couldn't break.) – Peter Cordes Jun 26 '21 at 23:06
1

@PeterCordes Doesn't this part of the spec prohibit the re-ordering of the `m1.unlock(); m2.lock()` pair? (http://eel.is/c++draft/intro.multithread#intro.progress-7): "For a thread of execution providing concurrent forward progress guarantees, the implementation ensures that the thread will eventually make progress for as long as it has not terminated." Basically a non-deadlocking program cannot be optimized into a potentially deadlocking one. – Erik Jun 29 '21 at 19:26
@Erik: That's guaranteeing that no thread will be starved indefinitely (in implementations where `std::thread` provides *concurrent* forward progress guarantees, which the standard says it *should*, but not must; see the new couple bullet points.) That's a stronger guarantee than simply not introducing deadlocks, but yes, it would rule out statically reordering to *force* `m2.lock()` to happen before `m1.unlock()`. It doesn't of course forbid the appearance on one run of an un-contended `m2.lock()` appearing to take the lock before other threads see `m1.unlock()`'s release operation. – Peter Cordes Jun 29 '21 at 19:31
@Erik: ISO C++ does (or should) prevent introducing deadlocks even on implementations without "concurrent forward progress" guarantees, where some activities by other threads *can* starve a thread indefinitely as long as *some* other threads are making progress. But if those other threads eventually got stuck waiting for a mutex, yes at that point even a DeathStation 9000 would have no choice but to let the `m1.unlock()` make progress in this thread. – Peter Cordes Jun 29 '21 at 19:35
@PeterCordes I'm not sure actually that it means what I thought it means. It seems that making an iteration of the test-and-set loop would be considered forward progress. I'm also need really buying your argument based on "An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.". A compiler I would want to use, cannot optimize my non-deadlocking code into deadlocking code. I can't find a water tight argument preventing that based on the standards wording. – Erik Jun 29 '21 at 22:27
@PeterCordes I think the standard requires a well formed program to not have any infinite loops. The lock-unlock re-ordering can transform a finite loop into a infinite loop. Hence it's not allowed. – Erik Jun 29 '21 at 22:55
@Erik: The C++ standard (unlike C IIRC), requires no infinite loops that don't contain an atomic or volatile access, or an IO operation. (Or something like that; I forget how it's phrased. But that might in theory allow a compliant implementation to implement `std::thread` with cooperative multi-threading where any of those operations are possible context-switch points.) – Peter Cordes Jun 29 '21 at 23:03
@Erik: Agreed that nobody would want to use a compiler that could introduce deadlocks (in this or any other way). I'm 100% sure that compiler devs agree with that, and understand that statically doing this transformation at compile-time is thus not allowed. The only question is how strongly we can say that it's formally by the standard, not just quality-of-implementation. IMO it would at least violate the as-if rule, producing a version of the program that didn't work the way the source did, with infinite loops or not being a qualitative difference. – Peter Cordes Jun 29 '21 at 23:10
@PeterCordes Hmm, you are right, infinite loop with atomic access is allowed (http://eel.is/c++draft/intro.multithread#intro.progress-1). Seems to me that the standard technically allows the lock-unlock transformation. – Erik Jun 29 '21 at 23:19
@Erik: I'm not totally convinced. It's not disallowed *for that reason*, but it seems like a violation of as-if. I *do* find it convincing that infinitely delaying visibility of the unlock would violate at least the *should* ... finite time part I quoted. It doesn't seem like something even a DeathStation 9000 implementation should be allowed to do. I hope there's also some more relevant formalism somewhere. Maybe it would be interesting to ask a new language-lawyer question, or put a bounty on this one, if someone else can find something in the standard. – Peter Cordes Jun 29 '21 at 23:32

Sudoplatov · Answer 2 · 2023-05-22T12:58:13.167

You can try CppMem (help page) to test things like that.
CppMem is a tool by the authors of Mathematizing C++ Concurrency.
To prove that some behavior of a program is allowed by the C++ memory model, one has to find an execution of this program for which each of the huge number of the memory model rules isn't violated.
CppMem can do it automatically.

I simplified your program to this (to reduce the number of executions that CppMem has to check):

// RA mutex deadlock
int main() {
  atomic_int m1 = 0;
  atomic_int m2 = 0;
  
  {{{ { 
        // m1.lock
        atomic_compare_exchange_strong_explicit(m1, 0, 1, memory_order_acquire, memory_order_acquire);
        
        // m1.unlock
        m1.store(0,memory_order_release);
        
        // m2 is busy
        m2.load(memory_order_acquire).readsvalue(1);
      }
  ||| { 
        // m2.lock
        atomic_compare_exchange_strong_explicit(m2, 0, 1, memory_order_acquire, memory_order_acquire);
        
        // m1 is busy
        m1.load(memory_order_acquire).readsvalue(1);
      }
  }}}
  return 0;
}

CppMem finds an execution where both threads are waiting for mutexes:

the execution graph

In this execution both threads simultaneously are trying to acquire a lock held by another thread.
That is similar to a deadlock, but fortunately, this cannot last forever because the standard also guarantees this:

6.9.2.3 [intro.progress]:

¹⁸ An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.
33.5.4 [atomics.order]:

¹¹ Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.

The quotes mean that the write m1.store(0,memory_order_release) (which is the last write in the modification order for m1) will have to become visible to m1.load(memory_order_acquire) eventually, and then the threads will be able to make progress.

That's not a deadlock, that's just *a* failure of some finite number of iterations of the lock-taking spin loop. Because as you say, there is a pending store which the standard requires to be visible to some future store. — Peter Cordes, May 21 '23 at 22:55
Unfortunately, that tool only works with C++11. I wish it had support for the C++20 concurrency features too. — digito_evo, May 21 '23 at 23:05

How C++ Standard prevents deadlock in spinlock mutex with memory_order_acquire and memory_order_release?

2 Answers2

You're reordering a whole retry loop, not just a single acquire

Linked

Related