Does C++ header use hardware support when enforcing concurrency or is purely an algorithm based solution

Question

How is C++ mutex implemented under the hood ? Does it only use Decker's, Peterson's or other algorithms to enforce mutual exclusion or does it also use hardware support such as interrupt ad compare-and-swap (CAS).

Is it possible to implement an entire multi-threading library consisting of mutex and condition variable without any hardware support ?

Mutex locking is called expensive as it involves the operating system to take part in this mechanism — Build Succeeded, Mar 11 '20 at 18:49
But is it possible to achieve pure algorithmic mutual exclusion, without getting the operating system involved ? — redenzione11, Mar 11 '20 at 18:56
https://codereview.stackexchange.com/questions/28083/mutex-implementation — Build Succeeded, Mar 11 '20 at 19:08
There are implementations which you can just study to answer that question for those implementations. Add to that the requirements from the standard in order to distinguish what is implementation-specific and what isn't. — Ulrich Eckhardt, Mar 11 '20 at 19:10

Maxim Egorushkin · Answer 1 · 2020-03-11T20:55:02.037

On Linux, when the mutex is uncontended, the lock and unlock happen in user-space only using atomic instructions, such as compare-and-swap. In a contended case a call into the kernel is required to wait on the mutex till it's been unlocked (on lock) or to wake up waiters (on unlock). Locking user-space mutexes does not disable interrupts.

Here is the source code for low-level mutex primitives in glibc, the comments are instructive:

/* Low-level locks use a combination of atomic operations (to acquire and
   release lock ownership) and futex operations (to block until the state
   of a lock changes).  A lock can be in one of three states:
   0:  not acquired,
   1:  acquired with no waiters; no other threads are blocked or about to block
       for changes to the lock state,
   >1: acquired, possibly with waiters; there may be other threads blocked or
       about to block for changes to the lock state.

   We expect that the common case is an uncontended lock, so we just need
   to transition the lock between states 0 and 1; releasing the lock does
   not need to wake any other blocked threads.  If the lock is contended
   and a thread decides to block using a futex operation, then this thread
   needs to first change the state to >1; if this state is observed during
   lock release, the releasing thread will wake one of the potentially
   blocked threads.

   Much of this code takes a 'private' parameter.  This may be:
   LLL_PRIVATE: lock only shared within a process
   LLL_SHARED:  lock may be shared across processes.

   Condition variables contain an optimization for broadcasts that requeues
   waiting threads on a lock's futex.  Therefore, there is a special
   variant of the locks (whose name contains "cond") that makes sure to
   always set the lock state to >1 and not just 1.

   Robust locks set the lock to the id of the owner.  This allows detection
   of the case where the owner exits without releasing the lock.  Flags are
   OR'd with the owner id to record additional information about lock state.
   Therefore the states of robust locks are:
    0: not acquired
   id: acquired (by user identified by id & FUTEX_TID_MASK)

   The following flags may be set in the robust lock value:
   FUTEX_WAITERS     - possibly has waiters
   FUTEX_OWNER_DIED  - owning user has exited without releasing the futex.  */

And pthread_mutex_lock code.

Mutex implementation requires multiple threads to agree on a value of a shared variable - the mutex, whether it is locked or unlocked. This is also known as consensus problem and it is not possible to solve it using only atomic loads and stores, a compare-and-swap is required.

score 3 · Accepted Answer · answered Mar 11 '20 at 20:28

3

Standard requires that there are atomic operations. Not necessarily CAS, but at least exchange. std::atomic_flag is required to be true atomic, although CAS is superfluous for it, simple exchange is satisfactory.

Note that any algorithm such as Decker's, Peterson's or other would at least require atomic load and atomic store, and they will not work if load and store are non-atomic. Also non-atomic types do not enforce memory ordering, which is implied by those algorithms.

It is not really required that std::mutex will be based on operating system calls, and that there are operating system calls at all.

In theory, std::mutex could be spinning only mutex.

It could also be blocking only mutex.

In practice a good implementation would first try to handle blocking with atomic, but when it is resorted to wait, it would do OS wait.

answered Mar 11 '20 at 20:28

Alex Guteniev

12,039
2
34
79

Thanks for your answer. So the _turn_ and the _flag_ array used in Peterson's and Decker algorithm require atomic load and store, that is they should be atomic boolean type, is that correct ? I see many implementations of the two algorithms online, where these two variables aren't defined as atomic types. – redenzione11 Mar 11 '20 at 21:41
1

In practice, `bool` or `volatile bool` would be atomic in terms they would be in a single piece (one instruction load / store, that would execute atomically). The problematic part is *memory order*, which would only work for atomics or with explicit memory fences. See here an example how Peterson lock is broken even on x86 if no ordering enforced: https://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/ – Alex Guteniev Mar 11 '20 at 22:11
Generally, synchronization algorithms based only on bool reads and writes are inefficient anyway (CPU cores do lots of unnecessary work), so they are not used in production. As the examples you find are not heavily used in practice, no issues may be discovered with them. – Alex Guteniev Mar 11 '20 at 22:20
Also some examples may be old. Until C++11, there was no official multithreading in C++. Unofficially, Peterson lock implementation could have assumed that memory order problems do not exist, and entire program execution is sequentially consistent. Which was true assumption for old singe core processors – Alex Guteniev Mar 11 '20 at 22:42

score 2 · Answer 3 · answered Mar 11 '20 at 18:43

2

It asks the operating system to do it.

On Linux that'll probably be using pthreads (How does pthread implemented in linux kernel 3.2?).

On Windows it's with the Windows API. You can try asking Microsoft about that one.

The standard library implementation won't be making direct hardware access. This is what operating systems are for.

answered Mar 11 '20 at 18:43

Asteroids With Wings

17,071
2
21
35

You're right that that's how it's typically implemented, but that's not required. And on a system that does not have native threading, of course, the standard library implementation can provide it. – Pete Becker Mar 11 '20 at 18:49
Is this the difference between kernel and user threads ? User threads calling kernel threads to achieve concurrency ? So are all multi-threading libraries in modern languages based on user thread and under the hood calls some kernel thread library function ? – redenzione11 Mar 11 '20 at 18:53
@PeteBecker Right but the point is the OP needs to specify an OS if they need an answer like that. In reality, the odds that this isn't covered by one of two, maybe three common and well-documented implementations, is fairly remote. – Asteroids With Wings Mar 12 '20 at 00:35
inb4 some comment about the embedded world; they usually won't be using `std::thread` but instead whatever the RTOS or similar provides. – Asteroids With Wings Mar 12 '20 at 00:36

Does C++ header use hardware support when enforcing concurrency or is purely an algorithm based solution

3 Answers3