CPU cycles spent in busy wait of a spinlock

Question

So far I have a nice spinlock that works as intendend:

    std::atomic_flag barrier = ATOMIC_FLAG_INIT;

    inline void lock( ){
        while( barrier
            .test_and_set( std::memory_order_acquire ) )
                {}
    }

However I want to know (indicatively) how many CPU cycles are spent within it (if busy wait is too long probably I'll consider a mutex which at least puts waiting threads to sleep):

    inline void lock( int & waitCounter){
        while( barrier
            .test_and_set( std::memory_order_acquire ) )
                waitCounter++;
    }

Ofcourse this does not keep in count the lock instruction itself, so by which constant should I increment the waitCounter to get a precise idea of cycles spent in busy wait (I consider instructions will not be pipelined because of memory barrier so the count is pretty precise in theory)?

waitCounter+=2;
waitCounter+=3;
waitCounter+=4; //...

Use [clock](http://www.cplusplus.com/reference/ctime/clock/) or `std::chrono` to get clocks/time and make your decision based on that. — ixSci, Aug 19 '15 at 10:55
If you want 100% accuracy with the cycle count, on x86/64 you can use `rdtsc`. There is no ARM equivalent that's available in user-space code though. — Richard J. Ross III, Aug 20 '15 at 20:32
You should also add a "pause" instruction to your loop. For x86 this would be `rep nop`. On Windows `rep nop` (or the equivalent for the current architecture) is accessible through the macro `YieldProcessor`. Other platforms will probably provide the same functionality under a different name. — Paul Groke, Aug 20 '15 at 20:40
You realize that you've made almost every possible mistake in your spinlock implementation. I would definitely not describe it as "nice". (For example, when you finally do get the lock, the most performance critical portion, you take the mother of all mispredicted branches as you exit a tight loop.) — David Schwartz, Aug 20 '15 at 22:53
@DavidSchwartz A good advice without explaining why is good, it is not a good advice (also why using a yield inside the loop as mentioned by Paul?). Apart the pipeline stall the lock obviously cause, how would you implement that without causing a pipeline stall? (I'll always upvote usefull answers, and other users will do so even they don't directly address the question. At least I'm not that evil ^^) I tried different implementations and they always suffered from race conditions, the above spinlock is the only one that is truly lockless and pass my "Petersons Test suite". — CoffeDeveloper, Aug 21 '15 at 11:12
@DarioOO I can't really teach a course in writing synchronization primitives in an answer. But if I could, I'd write that code on the board day one. It's the starting point, not the ending point. You can avoid the pipeline stall on x86 by using opcodes that inhibit branch prediction and speculative execution. (But if you didn't even think about the pipelining issue, you're not qualified to write synchronization primitives because they're one of the primary factors affecting their performance.) — David Schwartz, Aug 21 '15 at 16:37
uao you are right, only using "likely" and "unlikey" gives some boost on GCC. — CoffeDeveloper, Aug 22 '15 at 09:23

score 1 · Answer 1 · edited May 23 '17 at 12:22

The number of cycles required by a spinlock is dependent on a number of things, including the number of threads attempting to perform a spinlock simultaneously.

I did testing on this recently, here.

The short answer: it can vary greatly due to things that you are able to control directly (application code) and things that you cannot (bus contention). The relation between the lowest number of cycles and the greatest can be 110 to 950 or greater.

score 0 · Answer 2 · answered Aug 19 '15 at 10:42

At least on GCC with -O4 seems the function

inline void lock( int & waitCounter){
    while( barrier
        .test_and_set( std::memory_order_acquire ) )
            waitCounter+=5;
    waitCounter+=2;

boils down to a code that keeps a count of instructions that itself utilize

.L5:
    add DWORD PTR [rdi], 5
.L3:
    mov eax, edx
    xchg    al, BYTE PTR barrier[rip]

    test    al, al
    jne .L5
    add DWORD PTR [rdi], 2
    ret

This is far from being a complete answer but may give the Idea.

CPU cycles spent in busy wait of a spinlock

2 Answers2