DWCAS-alternative with no help of the kernel

Question

I just wanted to test if my compiler recognizes ...

atomic<pair<uintptr_t, uintptr_t>

... and uses DWCASes on it (like x86-64 lock cmpxchg16b) or if it supplements the pair with a usual lock.

So I first wrote a minimal program with a single noinline-function which does a compare and swap on a atomic pair. The compiler generated a lot of code for this which I didn't understand, and I didn't saw any LOCK-prefixed instructions in that. I was curious about whether the implementation places a lock within the atomic and printed a sizeof of the above atomic pair: 24 on a 64-bit-platform, so obviously without a lock.

At last I wrote a program which increments both portions of a single atomic pair by all the threads my system has (Ryzen Threadripper 64 core, Win10, SMT off) a predefined number of times. Then I calculated the time for each increment in nanoseconds. The time is rather high, about 20.000ns for each successful increment, so it first looked to me if there was a lock I overlooked; so this couldn't be true with a sizeof of this atomic of 24 bytes. And when I saw at the Processs Viewer I saw that all 64 cores were nearly at 100% user CPU time all the time - so there couldn't be any kernel-interventions.

So is there anyone here smarter than me and can identify what this DWCAS-substitute does from the assembly-dump ?

Here's my test-code:

#include <iostream>
#include <atomic>
#include <utility>
#include <thread>
#include <mutex>
#include <condition_variable>
#include <chrono>
#include <vector>

using namespace std;
using namespace chrono;

struct uip_pair
{
    uip_pair() = default;
    uip_pair( uintptr_t first, uintptr_t second ) :
        first( first ),
        second( second )
    {
    }
    uintptr_t first, second;
};

using atomic_pair = atomic<uip_pair>;

int main()
{
    cout << "sizeof(atomic<pair<uintptr_t, uintptr_t>>): " << sizeof(atomic_pair) << endl;
    atomic_pair ap( uip_pair( 0, 0 ) );
    cout << "atomic<pair<uintptr_t, uintptr_t>>::is_lock_free: " << ap.is_lock_free() << endl;
    mutex mtx;
    unsigned nThreads = thread::hardware_concurrency();
    unsigned ready = nThreads;
    condition_variable cvReady;
    bool run = false;
    condition_variable cvRun;
    atomic_int64_t sumDur = 0;
    auto theThread = [&]( size_t n )
    {
        unique_lock<mutex> lock( mtx );
        if( !--ready )
            cvReady.notify_one();
        cvRun.wait( lock, [&]() -> bool { return run; } );
        lock.unlock();
        auto start = high_resolution_clock::now();
        uip_pair cmp = ap.load( memory_order_relaxed );
        for( ; n--; )
            while( !ap.compare_exchange_weak( cmp, uip_pair( cmp.first + 1, cmp.second + 1 ), memory_order_relaxed, memory_order_relaxed ) );
        sumDur.fetch_add( duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count(), memory_order_relaxed );
        lock.lock();
    };
    vector<jthread> threads;
    threads.reserve( nThreads );
    static size_t const ROUNDS = 100'000;
    for( unsigned t = nThreads; t--; )
        threads.emplace_back( theThread, ROUNDS );
    unique_lock<mutex> lock( mtx );
    cvReady.wait( lock, [&]() -> bool { return !ready; } );
    run = true;
    cvRun.notify_all();
    lock.unlock();
    for( jthread &thr : threads )
        thr.join();
    cout << (double)sumDur / ((double)nThreads * ROUNDS) << endl;
    uip_pair p = ap.load( memory_order_relaxed );
    cout << "synch: " << (p.first == p.second ? "yes" : "no") << endl;
}

[EDIT]: I've extracted the compare_exchange_weak-function into a noinline-function and disassembled the code:

struct uip_pair
{
    uip_pair() = default;
    uip_pair( uintptr_t first, uintptr_t second ) :
        first( first ),
        second( second )
    {
    }
    uintptr_t first, second;
};

using atomic_pair = atomic<uip_pair>;

#if defined(_MSC_VER)
    #define NOINLINE __declspec(noinline)
#elif defined(__GNUC__) || defined(__clang__)
    #define NOINLINE __attribute((noinline))
#endif

NOINLINE
bool cmpXchg( atomic_pair &ap, uip_pair &cmp, uip_pair xchg )
{
    return ap.compare_exchange_weak( cmp, xchg, memory_order_relaxed, memory_order_relaxed );
}

    mov    eax, 1
    mov    r10, rcx
    mov    r9d, eax
    xchg   DWORD PTR [rcx], eax
    test   eax, eax
    je     SHORT label8
label1:
    mov    eax, DWORD PTR [rcx]
    test   eax, eax
    je     SHORT label7
label2:
    mov    eax, r9d
    test   r9d, r9d
    je     SHORT label5
label4:
    pause
    sub    eax, 1
    jne    SHORT label4
    cmp    r9d, 64
    jl     SHORT label5
    lea    r9d, QWORD PTR [rax+64]
    jmp    SHORT label6
label5:
    add    r9d, r9d
label6:
    mov    eax, DWORD PTR [rcx]
    test   eax, eax
    jne    SHORT label2
label7:
    mov    eax, 1
    xchg   DWORD PTR [rcx], eax
    test   eax, eax
    jne    SHORT label1
label8:
    mov    rax, QWORD PTR [rcx+8]
    sub    rax, QWORD PTR [rdx]
    jne    SHORT label9
    mov    rax, QWORD PTR [rcx+16]
    sub    rax, QWORD PTR [rdx+8]
label9:
    test   rax, rax
    sete   al
    test   al, al
    je     SHORT label10
    movups xmm0, XMMWORD PTR [r8]
    movups XMMWORD PTR [rcx+8], xmm0
    xor    ecx, ecx
    xchg   DWORD PTR [r10], ecx
    ret
label10:
    movups xmm0, XMMWORD PTR [rcx+8]
    xor    ecx, ecx
    movups XMMWORD PTR [rdx], xmm0
    xchg   DWORD PTR [r10], ecx
    ret

Maybe someone understands the disassembly. Remember that XCHG is implicitly LOCK'ed on x86. It seems to me that MSVC uses some kind of software transactional memory here. I can extend the shared structure embedded in the atomic arbitrarily but the difference is still 8 bytes; so MSVC always uses some kind of STM.

You might want to look at the result of [std::atomic::is_lock_free](https://en.cppreference.com/w/cpp/atomic/atomic/is_lock_free) — Richard Critten, Sep 19 '21 at 16:46
Which compiler are you using? Both gcc and clang fail for me with 'std::atomic requires a trivially copyable type'. — unddoch, Sep 19 '21 at 16:46
Same as @unddoch - failed to compile - live - https://godbolt.org/z/bEnhqhfhv — Richard Critten, Sep 19 '21 at 16:51
As @RichardCritten pointed out, use the `is_lock_free` function to determine if it is using a lock under the hood. That being said, if I were you I'd test the performance difference between a sequentially consistent memory ordering (default for `std::atomic`) and an `std::mutex`. I would expect, in low contention situations, that the `std::mutex` would perform better than sequentially consistent operations on an `std::atomic`. — WBuck, Sep 19 '21 at 17:01
I changed the code so that I use my own pair. I compiled it with g++ 12 and interestingly the compiler complains about missing __atomic_load_16 and __atomic_compare_exchange_16 functions. So at least gcc12 recognises my pair and uses a DCAS for this. But how do I fix the linker-errors ? Supplying -mcx16 doesn't work. — Bonita Montero, Sep 19 '21 at 17:04
@WBuck -std=c++20 -mcx16 with gcc 11 (not 12 as I said above). — Bonita Montero, Sep 19 '21 at 17:12
Those functions are in libatomic. So you have to link with `-latomic`, and look inside its source to see how it's implemented. — Nate Eldredge, Sep 19 '21 at 17:13
Ok, now it works ! On my older Linux computer, a Phenom X4 945 3GHz I get about 55 clock-cycles with one thread and 727 clock-cycles with four threads. So the compiler really uses a DCAS !!! — Bonita Montero, Sep 19 '21 at 17:21
If I single-step into `__atomic_compare_exchange_16`, I do indeed encounter a `lock cmpxchg16b`. I think it's wrapped in a library function so that it can fall back at runtime to another implementation if `cmpxchg16b` isn't available. — Nate Eldredge, Sep 19 '21 at 17:22
Yes, but try it with MSVC - there's a lot of code without any lock-prefixes and without any kernel-calls ! And for the Linux-issue: the code on Linux still says that my atomic isn't lock-free although it uses a single locked instruction according to the timings and what you've been disassembling. — Bonita Montero, Sep 19 '21 at 17:26
I believe the issue is that `is_lock_free` is determined at compile time, and only returns true if the type is *guaranteed* lock-free. Since it might fall back to a non-lock-free implementation at runtime if you run your binary on some other machine, the compiler can't make that promise. — Nate Eldredge, Sep 19 '21 at 17:34
As best I can tell from [this output](https://godbolt.org/z/ahhGPsz5P), MSVC is putting a spinlock mutex around it. I'm not able to actually test it very easily. — Nate Eldredge, Sep 19 '21 at 17:39
@NateEldredge: Of course is_lock_free is evaluated at runtime - because there might be different behaviour depending on misalignment or whatever. That's while there is is_always_lock_free, which is static. — Bonita Montero, Sep 19 '21 at 17:59
Oh good point. It looks like this is a gcc libatomic decision. https://github.com/gcc-mirror/gcc/blob/f75b237254f32d5be32c9d9610983b777abea633/libatomic/glfree.c#L59 says "Users likely expect 'lock-free' to also mean 'fast', which is why we do not return true if, for example, we implement loads with this size/alignment using a CAS." Which is exactly the case for 16 bytes on x86-64 - if you step through `__atomic_load_16`, it does `lock cmpxchg16b` as well. — Nate Eldredge, Sep 19 '21 at 18:05
At a quick glance at the code, I think MSVC *is* putting a simple spinlock as a member of the `atomic>`. That's why it's 24 bytes instead of 16, as you'd expect if the compiler were going to handle it with atomic DCAS. And you can see in the disassembly: `rcx` points to the `atomic>` to be handled, `DWORD PTR [rcx]` is the lock, and the data itself is at `QWORD PTR [rcx+8]` and `QWORD PTR [rcx+16]`. Calling this "software transactional memory" seems too fancy a description. — Nate Eldredge, Sep 22 '21 at 14:28
By the way, are you compiling with optimization? There are some bits of the code that don't look optimized at all, e.g. `sete al` followed by `test al, al`. — Nate Eldredge, Sep 22 '21 at 14:29
@NateEldredge. With a "simple spinlock" the code wouldn'*t be so long. And spinlocks are unfeasible in userland because the code could be scheduled away in the middle of its work and other code mit wait infinitely. — Bonita Montero, Sep 23 '21 at 02:41
@BonitaMontero: I agree the asm is a little overcomplicated (which is why I asked about optimization), and it appears to have something like a linear backoff to not hammer on the cache line too frequently if the lock cannot be taken right away. But it definitely is a spinlock, I'm quite confident of that. You're absolutely right about the risk of a spinlock in case the thread holding the lock is scheduled out - that appears to be a risk that MSVC has decided to take, given that the lock should otherwise only be held for a few cycles. — Nate Eldredge, Sep 23 '21 at 02:49
We can walk through the disassembly if you'd like, but also think about what's not there - there's clearly no call to the operating system or any library function, so no way the code can yield its timeslice while waiting. There's no `cmpxchg16b`, and short of locking, there isn't any way to atomically emulate a DCAS with smaller atomic CAS (else CPU designers would not bother to provide DCAS). And it's clear that some sort of looping is taking place, with the backward jumps. — Nate Eldredge, Sep 23 '21 at 02:57
(It's exponential backoff actually, with a cap of 64 `pause` iterations between lock attempts. I misread `add r9d, r9d` before.) — Nate Eldredge, Sep 23 '21 at 05:40
@BonitaMontero: Hardly any ISAs have DCAS. (https://en.wikipedia.org/wiki/Double_compare-and-swap). Some m68k had it. What you're looking for is D**W**CAS, double-*word*-CAS, i.e. a CAS on a contiguous memory region two pointer-widths wide. x86-64 *can* do that with `lock cmpxchg16b` (except in the earliest AMD64 K8 CPUs which omitted that instruction). I edited your question to use that terminology. — Peter Cordes, Nov 18 '21 at 00:54

Alex Guteniev · Answer 1 · 2021-11-18T09:27:00.957

As Nate pointed out in comments, it is a spinlock.

You can look up source, it ships with the compiler. and is available on Github.

If you build an unoptimized debug, you can step into this source during interactive debugging!

There's a member variable called _Spinlock.

And here's the locking function:

#if 1 // TRANSITION, ABI, GH-1151
inline void _Atomic_lock_acquire(long& _Spinlock) noexcept {
#if defined(_M_IX86) || (defined(_M_X64) && !defined(_M_ARM64EC))
    // Algorithm from Intel(R) 64 and IA-32 Architectures Optimization Reference Manual, May 2020
    // Example 2-4. Contended Locks with Increasing Back-off Example - Improved Version, page 2-22
    // The code in mentioned manual is covered by the 0BSD license.
    int _Current_backoff   = 1;
    const int _Max_backoff = 64;
    while (_InterlockedExchange(&_Spinlock, 1) != 0) {
        while (__iso_volatile_load32(&reinterpret_cast<int&>(_Spinlock)) != 0) {
            for (int _Count_down = _Current_backoff; _Count_down != 0; --_Count_down) {
                _mm_pause();
            }
            _Current_backoff = _Current_backoff < _Max_backoff ? _Current_backoff << 1 : _Max_backoff;
        }
    }
#elif defined(_M_ARM) || defined(_M_ARM64) || defined(_M_ARM64EC)
    while (_InterlockedExchange(&_Spinlock, 1) != 0) { // TRANSITION, GH-1133: _InterlockedExchange_acq
        while (__iso_volatile_load32(&reinterpret_cast<int&>(_Spinlock)) != 0) {
            __yield();
        }
    }
#else // ^^^ defined(_M_ARM) || defined(_M_ARM64) || defined(_M_ARM64EC) ^^^
#error Unsupported hardware
#endif
}

(disclosure: I brought this increasing backoff from Intel manual into there, it was just an xchg loop before, issue, PR)

Spinlock use is known to be suboptimal, instead mutex that does kernel wait should have been used. The problem with spinlock is that in a rare case when context switch happens while holding a spinlock by a low priority thread it will take a while to unlock that spinlock, as scheduler will not be aware of high-priority thread waiting on that spinlock.

Sure not using cmpxchg16b is also suboptimal. Still, for bigger atomics non-lock-free mechanism has to be used. (There's no decision to avoid cmpxchg16b made, it is just a consequence of ABI compatibility down to Visual Studio 2015)

There's an issue about making it better, that will hopefully be addressed with the next ABI break: https://github.com/microsoft/STL/issues/1151

As for transaction memory. It might make sense to use hardware transaction memory there. I can speculate that Intel RTM could possibly be implemented there with intrinsics, or there may be some future OS API for them (like, enhanced SRWLOCK), but it is likely that nobody will want more complexity there, as non-lock-free atomic is a compatibility facility, not something you would deliberately want to use.

RTM could be used as a drop-in replacement for `lock cmpxchg16b`, likely making pure-load more efficient. (Especially with multiple concurrent readers, because it would truly be read only, not needing to get the line into Exclusive or Modified state.) It might not be any faster for pure-store, except for avoiding cx16 retry loops on contended stores. And of course replacing cx16 retry loops for .fetch_add and whatever if you had a 128-bit integer type that supported any of the other methods. — Peter Cordes, Nov 18 '21 at 07:37
RTM would be compatible with other threads using `lock cmpxchg16` so it wouldn't be an ABI break; that's the beauty or transactional memory: it's just making an atomic transaction out of multiple instructions, instead of the limits of one like `lock xadd`. (Of course a transaction that touches multiple memory locations is a fundamentally new capability, like actual DCAS, not just DWCAS for a contiguous pair of pointers.) — Peter Cordes, Nov 18 '21 at 07:41
@PeterCordes, unavailability of immediately accessible CPU with enabled RTM prevents me from attempts to introduce RTM anywhere, otherwise I might have tried already — Alex Guteniev, Nov 18 '21 at 07:46
I haven't actually played with it myself, I just understand the concept of how it works on a CPU-architecture level. (Or at least I think I do :P based on stuff like https://www.realworldtech.com/haswell-tm/ and understanding how `lock cmpxchg16b` works internally). Practical problems in using it include efficiently using the right code on CPUs that have it. And with Intel being like Lucy with a football to our Charlie Brown, it keeps getting disabled on more and more CPUs that once supported it. — Peter Cordes, Nov 18 '21 at 07:57

DWCAS-alternative with no help of the kernel

1 Answers1

Linked