C++ semaphore (semi lockfree), where do I get one?

Question

edit: this is not a duplicate of any question that allows mutex locking in post(). Please read carefully, I need a lockfree post()! Don't mark this duplicate if you don't have a real answer.

A semaphore (like in linux) is a useful building block that is not found in the c++ standard, and neither in boost (currently). I'm mainly talking about semaphores between threads of a single process, over a preemptive scheduler.

I'm specifically interested in them being non-blocking (i.e. lockfree) unless it actually needs to block. That is, a post() and try_wait() should be always lockfree. And wait() calls should be lockfree if their invocations strongly-happen-after enough post()s have returned. Also a blocking wait() should be blocked by the scheduler rather than spin locked. What if I also want a wait_for with timeout - how much does it complicate the implementation further, while still avoiding starvation?

Are there reasons why semaphores are not in the standard?

Edit3: So, I wasn't aware that there's a proposal to the standard P0514R4 that exactly deals with these issues, and has solutions to all the problems raised here, apart from specifically adding an std::semaphore. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0514r4.pdf

Also boost doesn't have these. Specifically, the ones in interprocess are spin locked.

What libraries support something like that?

Is it possible to implement it over windows api and other widespread systems?

edit: It is not possible to implement that lockfree using atomics+mutex+condition_variable - you either have to block in post or spin in wait. If you want a lockfree post(), you cannot lock a mutex in post(). I want to run on a possibly preemptive scheduler, and I don't want post() blocked by other threads that took the mutex and got preempted. So, this is not a duplicate of questions like C++0x has no semaphores? How to synchronize threads?

edit2: Following an example implementation just to demonstrate the best that can be done with atomics+mutex+condvar, AFAIK. post() and wait() perform one lockfree compare_exchange, and only if they must, they lock the mutex.

However post() is not lockfree. And worse, it might be blocked by a wait() that locked the mutex and got preempted.

For simplicity, I only implemented post_one() and wait_one_for(Duration), instead of post(int) and wait_for(int,Duration). Also I'm assuming no-spurious-wakeup which is not promised by the standard.

class semaphore //provides acquire release memory ordering for the user
{
private:
    using mutex_t = std::mutex;
    using unique_lock_t = std::unique_lock<mutex_t>;
    using condvar_t = std::condition_variable;
    using counter_t = int;

    std::atomic<counter_t> atomic_count_; 
    mutex_t mutex_;
    condvar_t condvar_;
    counter_t posts_notified_pending_;
    counter_t posts_unnotified_pending_;
    counter_t waiters_running_;
    counter_t waiters_aborted_pending_;

public:
    void post_one()
    {
        counter_t start_count = atomic_count_.fetch_add(+1, mo_acq_rel);
        if (start_count < 0) {
            unique_lock_t lock(mutex_);
            if (0 < waiters_running_) {
                ++posts_notified_pending_;
                condvar_.notify_one();
            }
            else {
                if (0 == waiters_aborted_pending_) {
                    ++posts_unnotified_pending_;
                }
                else {
                    --waiters_aborted_pending_;
                }
            }
        }
    }

    template< typename Duration >
    bool wait_one_for(Duration timeout)
    {
        counter_t start_count = atomic_count_.fetch_add(-1, mo_acq_rel);
        if (start_count <= 0) {
            unique_lock_t a_lock(mutex_);

            ++waiters_running_;
            BOOST_SCOPE_EXIT(&waiters_running_) {
                --waiters_running_;
            } BOOST_SCOPE_EXIT_END

            if( ( 0 == posts_notified_pending_ ) && ( 0 < posts_unnotified_pending_ ) ) {
                --posts_unnotified_pending_;
                return true;
            }
            else {

                auto wait_result = condvar_.wait_for( a_lock, timeout);
                switch (wait_result) {
                case std::cv_status::no_timeout: {
                    --posts_notified_pending_;
                    return true;
                } break;
                case std::cv_status::timeout: {

                    counter_t abort_count = atomic_count_.fetch_add(+1, mo_acq_rel);
                    if (abort_count >= 0) {
                        /*too many post() already increased a negative atomic_count_ and will try to notify, let them know we aborted. */
                        ++waiters_aborted_pending_;
                    }

                    return false;
                } break;
                default: assert(false); return false;
                }
            }
        }
        return true;
    }


    bool try_wait_one()
    {
        counter_t count = atomic_count_.load( mo_acquire );
        while (true) {
            if (count <= 0) {
                return false;
            }
            else if (atomic_count_.compare_exchange_weak(count, count-1, mo_acq_rel, mo_relaxed )) {
                return true;
            }
        }
    }
};

You're looking for a pure ISO C++11 version of [POSIX `sem_trywait(3p)`](http://man7.org/linux/man-pages/man3/sem_trywait.3p.html)? (Also the [Linux `sem_trywait(3)` man page](http://man7.org/linux/man-pages/man3/sem_wait.3.html)). Have you verified that Linux's trywait is truly lockless without making any system calls in the failure case? (e.g. set up an already-taken semaphore and sleep, and loop on trywait in another thread. `strace -f` to trace system calls.) — Peter Cordes, Nov 08 '18 at 07:39
@PeterCordes, my question is not linux specific, although I know the semaphores from linux. Well sem_post and sem_trywait use futex mechanism internally. I don't know the details of the implementation, if it is lock-free, or if it does spin-lock only in kernel mode, then it is what I want. If the futex causes spin-lock in user mode then I guess I can do just as well with mutex+condition_varaiable. Part of my question is getting an answer for how this works in linux, but then also windows and any widespread system. — itaj, Nov 08 '18 at 15:05
Should say it more clearly: sem_post/wait/trywait do one compare_exchange in user mode and only if they have to they use futex mechanism internally. I don't know the details of the implementation, if it is lock-free, or if it does spin-lock only in kernel mode, then it is what I want. If the futex causes spin-lock in user mode then less good (somewhat) - but AFAIK I can't even do that with mutex+condvar. — itaj, Nov 08 '18 at 15:38
There's a nice (mostly) cross-platform, high-performance implementation written by Jeff Preshing that I pilfered for use with the blocking versions of my lock-free queues (and subsequently enhanced slightly). You can find it at the top of this file: https://github.com/cameron314/concurrentqueue/blob/master/blockingconcurrentqueue.h The class you're specifically interested in is called `LightweightSemaphore`, which only blocks at the OS level if it has to (an atomic counter is maintained in user-space for the fast-path). — Cameron, Nov 08 '18 at 23:13
@Cameron, Thanks. That’s about the line of thinking that I was looking for. Can you clarify: In modern versions of linux the implementation of sem_* already does the user mode atomic counter compare_exchange, seems like you don’t need to wrap around it with your own atomic counter as in LightweightSemaphore. Were you able to measure a difference in performance between Semaphore and LightweightSemaphore on linux and/or Windows? Another thing, why is there an atomic_signal_fence(mo_acquire) in waitManyWithPartialSpinning, what data is it ordering for and what signal handler is it for? — itaj, Nov 09 '18 at 00:16
@Cameron oh, I see the comment about the atomic_signal_fence - I didn't know it could/should be used for eliminating such optimization. — itaj, Nov 09 '18 at 00:26
@itaj I haven't benchmarked it vs other ways of doing things on Linux. I also don't have a very recent kernel, so doing so probably wouldn't be very interesting. The user-mode counter is still useful when incrementing/decrementing many at once, which I don't think the underlying interface supports. Obviously your use case might not need this rather particular feature :-) — Cameron, Nov 09 '18 at 20:25

score 5 · Accepted Answer · answered Nov 09 '18 at 16:10

Yes, you can do this as long as your operating system offers a suitable "park" and "unpark" mechanism, which doesn't require taking a lock for unpark. Park refers to allowing a thread to go to sleep (OS blocking) and unpark refers to waking that thread.

You are already close with your atomic counter and condvar approach. The problem is that the condvar a mutex is required as part of the semantics. So you have to abandon condvars and go a bit lower level. First, you should pack all the state, such as the current semaphore value, whether there are any waiters (and possibly how many wwaiters), into a single atomic value, and manipulate this atomically with compare-and-exchange. This prevents races that would occur if you had these as separate values.

Then you can draw a state diagram showing all the possible states of the semaphore, with edges for all the possible transition states (e.g., the "no waiters" state would transition to "yes waiters" state when a waiter arrives). You implement all the transitions with compare-and-exchange, and whenever it fails you have to re-calculate the transition since it may have changed!

Then you only need to implement the blocking. On Windows you would use Events - either auto or manual reset. Both have their advantages and quirks, and there is more than one way to skin this cat. For example, you can probably get it to work with a single shared event and auto-reset events.

However, here's a sketch of one mechanism which use a per-thread waiter object in a lock free queue. The semaphore consists of an atomically manipulated control word, and a lock-free list with element type waiter_node or stack or whatever off-the-shelf concurrent list-like thing you want to use.

We will assume that each thread owns a waiter_node object which just contains a single manual reset Event object. This could be created once and stored in TLS (probably the most efficient), or allocated on demand every time a wait needs to occur and de-allocated when the wait is done.

Here's the basic outline:

Wait

If the semaphore is available (positive), CAS to decrement it and continue.
If the semaphore is not available (zero), the thread calls ResetEvent on its waiter_node and then pushes the event on the waiter list, checks that the sem value is still zero and then calls WaitForObject on it's waiter_node. When that returns, start the wait routine over from the top.

Post

Increment the control word. Pop a waiter_node, if any, and call SetEvent on it.

There are all sorts of "races" here, such as a waiter_node being popped by a post operation before the waiting thread even sleeps on it, but they should be benign.

There are many variants even on this waiter queue based design. For example, you may integrate the list "head" and the control word so they are the same thing. Then the wait doesn't need to double check the semaphore count since the push operation verifies the semaphore state at the same time. You might also implement "direct handoff" where the posting thread doesn't increment the control word at all if there are waiters, but just pops one and wakes it with the information that it has successfully acquired the semaphore.

On Linux, you replace Event with futex. It is easier there to implement a "single futex" solution since futex allows an atomic check-and-block operation inside the kernel that avoids a lot of the races inherent in the Event solution. So a basic sketch is a single control word, and you make your transitions atomically with CAS, and then using a futex() with FUTEX_WAIT to do a second check of the control word and block atomically (this atomic check-and-sleep is the power of futex).

Wow, thanks for the detailed answer. I suppose on linux the new sem_post/wait functions implemented with futex actually do what you suggest here, right? So I could use them. On windows I might also use directly CreateSemaphore as Cameron suggested in a comment. — itaj, Nov 10 '18 at 00:04
I’m only left with the question of whether SetEvent() ReleaseSemaphore() and futex_wake() are actually lock-free from the POV of the user thread. As long as they could never make the user thread block waiting on another thread that got preempted while using the object. I presume it make sense to expect so, but I can’t find it clearly in the documentations. I don’t mind if they spin-lock in kernel mode against other kernel threads because they are all active and it would be lock-free from the user thread POV. — itaj, Nov 10 '18 at 00:04
@itaj - about `sem_post` and `wait`, I don't know you'll have to check the code. — BeeOnRope, Nov 10 '18 at 02:53
I think typical non-RT versions of Linux don't use `CONFIG_PREEMPT`, so in general while inside a system call your process won't get swapped out. So in that case it is lock-free at least in the sense that it won't be swapped out while holding a kernel lock, preventing the progress of other threads, since that seems to be what you are after. Whether it actually grabs at least one lock in that kernel code though, I'm not sure: you'll have to read the code. I think it is very unlikely that the kernel for `futex` code can be interrupted holding a lock since that would be a big bottleneck. — BeeOnRope, Nov 10 '18 at 02:57
Similar comments apply to Windows `SetEvent` - although there of course you can't read the code! — BeeOnRope, Nov 10 '18 at 02:58
I went to look in the linux kernel code. Seems that all spinlocks call preemt_disable/enable internally. So whether or not CONFIG_PREEMPT is defined, the thread cannot be preempted inside a spinlock section. Which means ultimately that sem_post/trywait cannot be blocked by preemption of other threads. Presumably CONFIG_PREEMPT_COUNT must be defined if CONFIG_PREEMPT is. Thanks for your help! — itaj, Nov 11 '18 at 00:03

C++ semaphore (semi *lockfree*), where do I get one?

1 Answers1

Wait

Post

C++ semaphore (semi lockfree), where do I get one?