C & low-level semaphore implementation

Question

I was thinking about how to implement semaphores (not binary) using less asm code as possible.
I haven't succeeded in thinking and writing it without using a mutex, so here's the best I could do till now:

Global:

#include <stdlib.h>
#include <pthread.h>
#include <stdatomic.h>
#include <stdbool.h>

typedef struct
{
    atomic_ullong    value;
    pthread_mutex_t *lock_op;
    bool             ready;
} semaphore_t;

typedef struct
{
    atomic_ullong   value;
    pthread_mutex_t lock_op;
    bool            ready;
} static_semaphore_t;

 /* use with static_semaphore_t */
#define SEMAPHORE_INITIALIZER(value) = {value, PTHREAD_MUTEX_INITIALIZER, true}

Functions:

bool semaphore_init(semaphore_t *semaphore, unsigned long long init_value)
{   
    if(semaphore->ready) if(!(semaphore->lock_op = \
                             calloc(1, sizeof(pthread_mutex_t)))) return false;
    else                 pthread_mutex_destroy(semaphore->lock_op);   

    if(pthread_mutex_init(semaphore->lock_op, NULL))
            return false;

    semaphore->value = init_value;
    semaphore->ready = true;
    return true;
}

bool semaphore_wait(semaphore_t *semaphore)
{
    if(!semaphore->ready) return false;

    pthread_mutex_lock(&(semaphore->lock_op));
    while(!semaphore->value) __asm__ __volatile__("nop");
    (semaphore->value)--;
    pthread_mutex_unlock(&(semaphore->lock_op));
    return true;
}

bool semaphore_post(semaphore_t *semaphore)
{
    if(!semaphore->ready) return false;

    atomic_fetch_add(&(semaphore->value), (unsigned long long) 1);
    return true;
}

Is it possible to implement a semaphore using only few lines, with the atomic built-ins or directly in assembly (ex. lock cmpxchg)?

Looking at the sem_t struct from <bits/sempahore.h> included by <semaphore.h> it seems to me that it has been chosen a very different path...

typedef union
{
    char __size[__SIZEOF_SEM_T];
    long int __align;
} sem_t;

UPDATE:

@PeterCordes has proposed a definitely much better solution, using the atomics, without a mutex, doing the checks directly on the semaphore value.

I still want to understand better the chances to improve the code in terms of performance taking advantages of the built-in pauses functions or kernel calls that avoid CPU waste, waiting the critical resources to be available.

It also would be nice to have a standard implementation of mutexes and non binary semaphores for comparison.
From futex(7) I read: "The Linux kernel provides futexes ("Fast user-space mutexes") as a building block for fast user-space locking and semaphores. Futexes are very basic and lend themselves well for building higher-level locking abstractions such as mutexes, condition variables, read-write locks, barriers, and semaphores."

I don't see where you include `stdatomics`. Not clear what you want to accomplish. — too honest for this site, Mar 18 '16 at 21:02
You certainly can use the atomic operations to do the value compare. But you still need kernel support to suspend and wake up processes. Unless you do a busy loop which is not a feasible general solution. — kaylum, Mar 18 '16 at 21:06
@Olaf I'd want to find the most efficient way in terms of code size and speed to implement a not binary semaphore on my own. — JumpAlways, Mar 18 '16 at 21:14
@kaylum I think I can't do it with only one compare operation, because n threads can pass the not null check when `value < nthreads` and make the value to go negative, so, as you said, I need to check and decrement atomically and I used a mutex for this. As what regards the busy loop/wait I though was what the mutex_lock did. Are you saying there's no a more efficient and elegant way to to this? — JumpAlways, Mar 18 '16 at 21:23
You need a atomic `test&set` instruction. https://en.wikipedia.org/wiki/Test-and-set — Rocki, Mar 18 '16 at 21:26
@JumpAlways You would use `cmpxchg` not on the lock but on the value itself. " I though was what the mutex_lock did". Yes it is, but you said you want to avoid using that. So either you need a different call to get kernel wait/wake support or you need to do a busy loop. "Are you saying there's no a more efficient and elegant way to to this" I'm not brave enough to say that. Some clever person may have a way. I'm just saying I think you can't do it with just atomic operation. You need kenel wait/wake support. — kaylum, Mar 18 '16 at 21:30
You might want to start by learning basic C before you try this. `if (!a = b){/**/}` does not do what you seem to think it does. — EOF, Mar 18 '16 at 22:53
@JumpAlways "*As what regards the busy loop/wait I though was what the mutex_lock did,*" Absolutely not! Real world low-level synchronization primitives are written by people with deep platform knowledge. There are a dozen ways to get this wrong that most programmers don't even know exist. See [here](http://stackoverflow.com/a/35331884/721269) for more information on a similar issue. You can produce a toy naively, but don't kid yourself into thinking you're going to get good performance. And don't think you can easily write a benchmark to tell the good from the bad, because that's hard too. — David Schwartz, Mar 18 '16 at 23:11
@DavidSchwartz Thanks for the link... I really don't want to use a custom semaphore in place of the standard ones, I want to understand how to implement conceptually a non binary semaphore. However I'm deeply interested in the art of writing them in the most secure and efficient way. So, I you can provide some materials about the standard implementations of mutex and semaphores and detailed explainations on how low-level synchronization primitives are though and written I'd really appreciate this. :-) — JumpAlways, Mar 19 '16 at 01:38
@JumpAlways typical 'real' semaphores, (not the joke ones with just user-space loops round atomics), need an atomic count and a queue for the threads waiting on the semaphore. Typical semaphore waits are much longer than typical mutex waits, and so the only reasonable approach, if no units are available from the semaphore, is for execution to be removed from the threads calling wait(). That is why non-crazy semaphore implementations need a kernel entry, either immediately, or after a short spin on multicore systems only. — Martin James, Mar 19 '16 at 09:28
@JumpAlways And they take into account cache pollution, branch prediction, and conservation of virtual core and inter-core resources. — David Schwartz, Mar 19 '16 at 19:06

Peter Cordes · Accepted Answer · 2016-03-19T05:51:49.947

11

See part way down for my minimal naive semaphore implementation which probably works. It compiles and looks right for x86. I think it's correct in general for any C11 implementation.

IIRC, it's possible to implement a counting lock (aka semaphore) with just a single integer, which you access with atomic operations. That wikipedia link even gives algorithms for up/down. You don't need a separate mutex. If atomic_ullong needs a mutex to support atomic increment/decrement on the target CPU, it will include one. (This may be the case on 32bit x86, or the implementation uses a slow cmpxchg8 instead of a fast lock xadd. Is a 32bit counter really too small for your semaphore? Because 64bit atomics will be slower on 32bit machines.)

The <bits/sempahore.h> union definition is clearly just an opaque POD type with the correct size, not indicative of the actual implementation.

As @David Schwartz says, it's a fool's errand to implement your own locking for practical use, unless you're an expert. It might be an interesting way to learn about atomic operations and find out what's under the hood in the standard implementations, though. Note carefully his caution that locking implementations are hard to test. You might write code that works for your test-case on your hardware with code from the current version of your compiler with your chosen compile options...

The ready boolean is just a total waste of space. If you can correctly initialize the ready flag so that it's meaningful for functions to look at it, then you can initialize the other fields to a sane initial state.

Your code has a few other problems I noticed:

#define SEMAPHORE_INITIALIZER(value) = {value, PTHREAD_MUTEX_INITIALIZER, true};

static_semaphore_t my_lock = SEMAPHORE_INITIALIZER(1);
// expands to my_lock = = {1, PTHREAD_MUTEX_INITIALIZER, true};;
// you should leave out the = and ; in the macro def so it works like a value

Using a dynamically-allocated pthread_mutex_t *lock_op is just silly. Use value, not a pointer. Most of your locking functions use the mutex, so the extra level of indirection just slows things down. It would be much better for the memory to just be there along with the counter. A mutex doesn't need a lot of space.

while(!semaphore->value) __asm__ __volatile__("nop");

We want this loop to avoid wasting power and slowing down other threads and, even other logical threads sharing the same core with hyperthreading.

nop doesn't make a busy-wait loop less CPU-intensive. Even with hyperthreading, it probably makes no difference on x86, because the entire loop body still probably fits in 4 uops, and so issues at one iteration per clock whether there's a nop in there or not. The nop doesn't need an execution unit, so at least it doesn't hurt. This spin-loop happens with a mutex held, which seems silly. So the first waiter will get to this spin loop, while waiters after that will spin on the mutex.

Here's my naive implementation of a semaphore, using only C11 atomic ops

I think this is a good implementation that achieves its very-limited goals of being correct and small (source code and machine code), and not using other actual locking primitives. There are major areas that I don't even attempt to address (e.g. fairness/starvation, yielding the CPU to other threads, probably other stuff).

See the asm output on godbolt: only 12 x86 insns for down, 2 for up (including the rets). Godbolt's non-x86 compilers (gcc 4.8 for ARM/ARM64/PPC) are too old to support C11 <stdatomic.h>. (They do have C++ std::atomic, though). So I unfortunately can't easily check the asm output on non-x86.

#include <stdatomic.h>
#define likely(x)       __builtin_expect(!!(x), 1)
#define unlikely(x)     __builtin_expect(!!(x), 0)

typedef struct {
  atomic_int val;   // int is plenty big.  Must be signed, so an extra decrement doesn't make 0 wrap to >= 1
} naive_sem_t;

#if defined(__i386__) || defined(__x86_64__)
#include <immintrin.h>
static inline void spinloop_body(void) { _mm_pause(); }  // PAUSE is "rep nop" in asm output
#else
static inline void spinloop_body(void) { }
#endif

void sem_down(naive_sem_t *sem)
{
  while (1) {
    while (likely(atomic_load_explicit(&(sem->val), memory_order_acquire ) < 1))
        spinloop_body();  // wait for a the semaphore to be available
    int tmp = atomic_fetch_add_explicit( &(sem->val), -1, memory_order_acq_rel );  // try to take the lock.  Might only need mo_acquire
    if (likely(tmp >= 1))
        break;              // we successfully got the lock
    else                    // undo our attempt; another thread's decrement happened first
        atomic_fetch_add_explicit( &(sem->val), 1, memory_order_release ); // could be "relaxed", but we're still busy-waiting and want other thread to see this ASAP
  }
}
// note the release, not seq_cst.  Use a stronger ordering parameter if you want it to be a full barrier.
void sem_up(naive_sem_t *sem) {
    atomic_fetch_add_explicit(&(sem->val), 1, memory_order_release);
}

The trick here is that it's ok for val to temporarily be too low; that just makes other threads spin. Also note that fetch_add being a single atomic operation is key. It returns the old value, so we can detect when val was taken by another thread between the while loop's load and the fetch_add. (Note that we don't need to check that tmp is == to the while loop's load: it's fine if another thread uped the semaphore between the load and the fetch_add. This is a benefit to using fetch_add instead of cmpxchg).

The atomic_load spin loop is just a performance optimization over having all the waiters doing atomic read-modify-writes on val. (Although with many waiters trying to dec and then undo with inc, having a waiter ever see the lock unlocked could be very rare).

A real implementation would have special stuff for more platforms than just x86. For x86, probably more than just a PAUSE instruction inside the spinloop. This is still just a toy example of a fully-portable C11 implementation. PAUSE apparently helps avoid mis-speculation on memory ordering so the CPU runs more efficiently after leaving the spin loop. pause is not the same as yielding the logical CPU to the OS for a different thread to run. It also has nothing to do with correctness and choice of memory_order_??? parameters.

A real implementation would probably give up the CPU to the OS after some number of iterations of spinning (sched_yield(2), or more likely a futex system call, see below). Maybe use x86 MONITOR / MWAIT to be even more hyperthreading-friendly; I'm not sure. I haven't ever implemented locking myself for real, I just see all this stuff in the x86 insn reference while looking up other insns.

As mentioned previously, x86's lock xadd instruction implements fetch_add (with sequential-consistency semantics, since locked instructions are always a full memory barrier). On non-x86, using only acquire+release semantics for the fetch_add, not full sequential consistency might possibly allow more efficient code. I'm not sure, but using just acquire would quite likely allow more efficient code on ARM64. I think we do only need acquire on the fetch_add, not acq_rel, but I'm not sure. On x86 there won't be any difference in code, since locked instructions are the only way to do atomic read-modify-write, so even relaxed will be the same as seq_cst (except for compile-time reordering.)

If you want to yield the CPU instead of spinning, you need a system call (as people have said). Obviously a lot of work has gone into making the standard library locking as efficient as possible on Linux. There are dedicated system calls for helping the kernel wake the right thread(s) when a lock is released, and they are not simple to use. From futex(7):

NOTES
To reiterate, bare futexes are not intended as an easy-to-use abstraction for end-users. (There is no wrapper function for this system call in glibc.) Implementors are expected to be assembly literate and to have read the sources of the futex user-space library referenced below.

fairness / starvation (which my naive implementation ignores)

As the wikipedia article mentions, some kind of wakeup queue is a good idea, so the same thread doesn't keep getting the semaphore every time. (Code that takes a lock quickly after releasing it would usually have the releasing thread get the lock while other threads are still asleep).

This is another major benefit to kernel cooperation in the process (futex).

edited Mar 19 '16 at 05:51

answered Mar 19 '16 at 02:00

Peter Cordes

328,167
45
605
847

From the link you provided: _"If the implementation does not ensure atomicity of the increment, decrement and comparison operations, then there is a risk of increments or decrements being forgotten, or of the semaphore value becoming negative. Atomicity may be achieved by using a machine instruction that is able to read, modify and write the semaphore in a single operation. In the absence of such a hardware instruction, an atomic operation may be synthesized through the use of a software mutual exclusion algorithm."_ – JumpAlways Mar 19 '16 at 02:29
1

_"On uniprocessor systems, atomic operations can be ensured by temporarily suspending preemption or disabling hardware interrupts. This approach does not work on multiprocessor systems where it is possible for two programs sharing a semaphore to run on different processors at the same time. To solve this problem in a multiprocessor system a locking variable can be used to control access to the semaphore. The locking variable is manipulated using a test-and-set-lock command."_ – JumpAlways Mar 19 '16 at 02:30
You need more than an atomic subctraction, because more threads than the semaphore value can pass the not null test if it's before the dec operation and then all decrement the sem value, making it negative, while if you put the test after the decrement the sem value can become negative (< -1), because only the decrement operation it's atomic and you can't know how many dec ops the pool can execute before switch to one of the thread test op. So, I think you can't do it without mutexes, even if you want to use a queue you have to test the value before calling the handler and that's not atomic. – JumpAlways Mar 19 '16 at 02:36
I had understood it, and I was arguing about the atomicity of the test op. You wrote a nice tricky example, in fact, I was saying you can't be sure only with an atomic up/write and test or viceversa. – JumpAlways Mar 19 '16 at 02:58
@JumpAlways: actually, I just noticed a bug in the implementation: The retry loop and the spin loop have to be different. I'm not sure exactly what my previous version did, but it wasn't what I meant to write. – Peter Cordes Mar 19 '16 at 03:04
Can you explain me in an easier way, why your code is less CPU-intensive? As @DavidSchwartz said the standards are carefully studied to be efficient, so what are the issues in using a loop of the standard `mutex_lock()` or `trylock()`? – JumpAlways Mar 19 '16 at 03:05
Yeah, I noticed the fact you were doing the ops if value < 1, but I had got the meaning in recheck if a thread decremented the sem when it shouldn't have to – JumpAlways Mar 19 '16 at 03:08
@JumpAlways: less CPU intensive that yours? It takes fewer instructions and the lock data type is smaller. It also doesn't use any library locking primitives, which is kinda cheating when trying to figure out how to implement locking yourself just in terms of atomic ops. IDK if there would be any measurable performance differences. Mine is just a simple naive busy-waiting implementation of semaphores. With a library semaphore you'd get a wait loop that at least used the x86 `PAUSE` instruction to save power and be more hyperthreading-friendly. Usually also yield the CPU to the OS. – Peter Cordes Mar 19 '16 at 03:10
So, this can be definitely an answer to my Q: _"Is it possible to implement a semaphore using only few lines, with the atomic built-ins or directly in assembly"_ In fact, you're claiming the mutex task can be replaced by a direct check on the sem value. – JumpAlways Mar 19 '16 at 03:12
2

I have to say I don't want to use a my own implementation, but I want to understand deeply this topic, so the challenge was to find out how to implement by my own a semaphore more efficiently as possible with less code as possible and maybe to be able to compare the differences with an actual standard implementation. – JumpAlways Mar 19 '16 at 03:17
@JumpAlways: yes, a naive implementation can be much more compact than yours, with simple loops that only compile to a few machine instructions. (As long as you're on a platform where `atomic_ullong` fits in a register. Seriously, when do you need more than 2^32 concurrent lock holders, esp. when you don't provide `up`/`down` by more than 1 at once? Using a 64bit type here is going to lead to nasty asm on 32bit platforms, with essentially no benefit on 64bit platforms.) – Peter Cordes Mar 19 '16 at 03:17
I think in your code it should be nice to pass from 1 to 0 and execute the critical region, so tmp should be checked to be greater or equal to 0. – JumpAlways Mar 19 '16 at 03:38
1

As regard the performance I still don't understand why using a fast op in an inf loop would require less CPU, I mean pause the thread and wait in somehow an interrupt it's thing, but looping some code whatever it is seems the same to me, unless you add a smart time sleep. – JumpAlways Mar 19 '16 at 03:45
@JumpAlways: the wikipedia implementation suggests that the normal semantics are `val` = number of resources available. So `0` means no slots available, i.e. locked. `1` or greater means there are resources available. That's how my code works, and seems to be what you're saying. I'm pretty sure the correct check is `tmp>=1`, because `fetch_add` returns the *old* value. So you check to make sure there was at least one resource available at the time you took one. Also, see my last edit for code that actually compiles and uses signed int to allow the count to actually be negative: major bug – Peter Cordes Mar 19 '16 at 03:48
Yeah, sure I confused `tmp` with the current value... The performance issue remains. My `up` function is one short two, 2 lines, depending on the machine, excluding the ready check. But your claimings about the loop still don't convince me: what is the meaning of using a faster check function if the cycle saved will be used again in the loop, because the thread still don't get the lock? – JumpAlways Mar 19 '16 at 03:53
@JumpAlways: *why using a fast op in an inf loop would require less CPU*? A busy-waiting CPU can have an impact on other CPUs, esp. if it keeps writing to the semaphore. If 10 threads are waiting on the semaphore, they will delay unlocking if they all have pending writes. The cache coherency protocol will have to arbitrate which core gets a turn to write to the semaphore. Also, with a lot of atomic decs and incs contending with each other, it might be rare to ever get to a situation where all the wrong-`dec`s were undone by `inc`s at once, so a thread would actually see the lock available. – Peter Cordes Mar 19 '16 at 03:55
Other than using a read-only loop to spin on `val` being available, making that loop more efficient helps with hyperthreading. The `PAUSE` "Spin Loop Hint" instruction exists to let the CPU know that you're running a spin loop, so it should give the other logical core priority. (And apparently helps avoid a mis-speculation on memory ordering, making the loop-exit more efficient when the lock becomes available.) Besides that, smaller code-size is a win (for I-cache / uop-cache / paging in code from disk reasons). You're right that spinning *faster* doesn't help, but more efficiently does. – Peter Cordes Mar 19 '16 at 04:01
I don't know the `mutex_lock()` implementation, so I can't know it need to write to the semaphore. I was saying that in general, I think an op like `lock cmpxchg`, for example, for a mutex can't be worse than a faster op, since they both wait until the resource is marked as available. – JumpAlways Mar 19 '16 at 04:02
Ok, I've read now your last comment... So your code produce this `PAUSE` instruction, since the loop its ligther, while mine is more CPU intenisve, right? – JumpAlways Mar 19 '16 at 04:06
Ah, now I see what you're asking. `lock xadd` isn't in the spin loop. The spin loop is just an `atomic_load`. There's a retry loop around the whole thing, but in the un-contended case it's just a load / `lock xadd` / break. A more efficient operation speed up the un-contended case. Speeding up the un-contended case also speeds up the retry loop, but that's not the useful part. re: your last: no, my code doesn't use `_mm_pause()`, but a real spin-loop should. Follow the godbolt link to see actual compiler output. – Peter Cordes Mar 19 '16 at 04:06
My other thought about `lock cmpxchg` being worse than `lock xadd` is that `cmpxchg` will fail if we thought `val` was 2 but find it's `3` instead. `fetch_add` only has to retry if the old `val` was actually less than 1, even if it's different from when we first checked to break out of the spin loop. So using compare-exchange leads to retries in cases where fetch_add doesn't need to. (implementing `fetch_add` in terms of `cmpxchg8` requires a retry loop just for the compare-exchange part, so you then have nested retry loops). Just look at the code with `atomic_llong` and `-m32`! – Peter Cordes Mar 19 '16 at 04:12
_"A more efficient operation speed up the un-contended case."_ Yeah, I got it but it's not useful in general, I think in case of a weight operation between the semaphore check and a low maxval for the semaphore it would be useful to slowdown the loop, while as you said, in case of fast ops in the critical region the loop shoul be as faster as it can. As regars `_mm_pause()`, I've read that not all machines support it and in these cases it produces a `nop`, so I used a nop in the example. – JumpAlways Mar 19 '16 at 04:15
Yeah, you're right, I've used `cmpxchg` to make a point on the consequences of the loop speed. – JumpAlways Mar 19 '16 at 04:19
@JumpAlways: How is a 64bit counter useful in a semaphore? It's completely ridiculous to need to support more than 2 billion threads inside the critical section at the same time. `MAX_INT` isn't "low". When you say "useful", I think you mean "worth it" (which would be true if MAX_INT was tiny). And re: `pause`: since machines that don't support it can run it as a `nop`, you should have written `pause`. The reverse isn't true: machines that support `_mm_pause` don't treat all `nop`s as `pause`. – Peter Cordes Mar 19 '16 at 04:26
It's an example a 4 byte int is enough... I though `_mm_pause()` was translated into `nop` where not supported at preprocessor time. – JumpAlways Mar 19 '16 at 04:27
At this point I agree with you and I think a lighter checker with a `pause` it's the best solution. – JumpAlways Mar 19 '16 at 04:29
@JumpAlways: It always translates into `F3 90`, which is a NOP with a prefix byte. That still doesn't explain why you'd put a `nop` in your code, since it will never be a `pause` even on machines that do support it. Since it was just for example reasons anyway, `asm volatile ("pause")` would have got the intent across. – Peter Cordes Mar 19 '16 at 04:29
I'm not an expert, so the last point is the memory reordering you used are safe in every platform or one want to use this custom implementation should use macros to predict the best for every situations? – JumpAlways Mar 19 '16 at 04:33
Also, make sure you're not getting confused between the retry loop (the `while(1)` around the whole thing) and the spin loop (`while(load(val) < 1) ;`). The spin loop doesn't need to be *fast*, because as you noted earlier, waiting faster doesn't help. It does help if it doesn't slow down other threads (even with hyperthreading). The retry loop does need to be fast, because it always matters. Anyway, `pause` is nowhere near enough to make my naive implementation comparable with one that interacts with the kernel for long waits. – Peter Cordes Mar 19 '16 at 04:33
@JumpAlways: re: memory ordering: I tried to use the weakest `memory_order_xxx` that would still let the semaphore protect a critical section according to C11 memory ordering rules, not just on x86. I think my `acq_rel` increment could safely be [just an `acquire`](http://preshing.com/20120913/acquire-and-release-semantics/). This will let the compiler use whatever is necessary on the target platform, but hopefully nothing beyond what's necessary. – Peter Cordes Mar 19 '16 at 04:37
Ok, Thanks a lot! Maybe we have to put your example code with the `pause` in another answer with a `?` on the memory ordering and see votes and opionions only for that code, since this is about the 30th comment and since my Q was: _"Is it possible to implement a semaphore using only few lines, with the atomic built-ins or directly in assembly"_ and I think **your code is the new starting point to continue the conceptual discussion on how to safely implement a non binary semaphore more efficient as possible, using less code as possible.** :-) – JumpAlways Mar 19 '16 at 04:46
@JumpAlways: I just updated my answer with `_mm_pause`, and some more comments. I think it's now in good shape as an answer to the question of whether we can implement semaphores with only a few lines of C / instructions in asm. `pause` has nothing to do with the correctness of the locking or choice of `memory_order_something`, though. I'm not sure what you think is still worth asking about. It helps the CPU avoid mis-speculation, which is only a performance effect. The only thing I'm not sure about in my code is whether `mo_acquire` would be enough in the `fetch_add`. – Peter Cordes Mar 19 '16 at 04:58

C & low-level semaphore implementation

1 Answers1

Here's my naive implementation of a semaphore, using only C11 atomic ops

fairness / starvation (which my naive implementation ignores)

Linked