I've developed a monitor-object like that of Java for C++ with some improvements. The major improvement is that there's not only a spin-loop for locking and unlocking but also for waiting on an event. In this case you don't have to lock the mutex but supply a predicate on a wait_poll-function and the code repeatedly tries to lock the mutex polling and if it can lock the mutex it calls the predicate which returns (or moves) a pair of a bool and the result-type.
Waiting to for a semaphore and or a event-object (Win32) in the kernel can easily take from 1.000 to 10.000 clock-cylces even when the call immediately returns because the semaphore or event has been set before. So there has to be a spin count with a reasonable relationship to this waiting-inteval, f.e. spinning one tenth of the minimum interval being spent in the kernel.
With my monitor-object I've taken the spincount recalculation-algorithm from the glibc. And I'm also using the PAUSE-instruction. But I found that on my CPU (TR 3900X) the pause instruction is too fast. It's about 0,78ns on average. On Intel-CPUs its much more reasonable with about 30ns.
This is the code:
#include <iostream>
#include <chrono>
#include <cstddef>
#include <cstdint>
#include <immintrin.h>
using namespace std;
using namespace chrono;
int main( int argc, char **argv )
{
static uint64_t const PAUSE_ROUNDS = 1'000'000'000;
auto start = high_resolution_clock::now();
for( uint64_t i = PAUSE_ROUNDS; i; --i )
_mm_pause();
double ns = (int64_t)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count() / (double)PAUSE_ROUNDS;
cout << ns << endl;
}
Why has AMD taken such a silly PAUSE-timing ? PAUSE is for spin-wait-loops and should closely match the time it takes for a cacheline-content to flip to a different core and back.