This question asks whether a spin lock can be improved in a way that doesn't compromise latency, but uses less CPU time. A large list of answers suggest high level language concepts in C++11, Boost, and the like.
My first thought was to use a simple C semaphore, since the poster needs to block only when a buffer is empty or full.
In the process of writing up an answer however, I realized I do not know what the overhead of these functions is in practice. Intuitively, it seems like it should be small, and it's never been an optimization issue for me, but perhaps it's substantial vs. a spin lock. Presumably it's also system dependent.
Answers to this question suggest that a spin lock is preferred when locking for less than one thread quanta, but no real world indications are given as to why.
Answers to this question provide a working example of a semaphore implementation in C++ that uses a spin lock with pthread_wait in the body, but it's not taken from any actual language implementation.
Over here, a question about speed differences between a mutex and a semaphore are declared to be insignificant by some. Others say the semaphore is slower.
An article linked to by this question suggests that the C# lock command for a mutex costs 50ns in practice on a 2.4GhZ machine (so ~100 cycles). However, it's unclear whether C#'s implementations are representative of say, a straight C implementation of the POSIX semaphore.
So, the question is, what's the overhead on semaphore use like in practice, and by extension, when should I prefer a spin lock if all I care about is latency (i.e. not maintainability for some reason)?