How do atomics larger than the CPU's native support work

Question

With current C++ compilers you can have atomic support of atomics that are larger than the actual support of your CPU. With x64 you can have atomics that are 16 bytes, but std::atomic also works with larger tuples. Look at this code:

#include <iostream>
#include <atomic>

using namespace std;

struct S { size_t a, b, c; };

atomic<S> apss;

int main()
{
    auto ref = apss.load( memory_order_relaxed );
    apss.compare_exchange_weak( ref, { 123, 456, 789 } );
    cout << sizeof ::apss << endl;
}

The cout above always prints 32 for my platform. But how do these transactions actually work without a mutex ? I don't get any clue from inspecting the disassembly.

If I run the following code with MSVC++:

#include <atomic>
#include <thread>
#include <array>

using namespace std;

struct S { size_t a, b, c, d, e; };

atomic<S> apss;

int main()
{
    array<jthread, 2> threads;
    auto threadFn = []()
    {
        auto ref = apss.load( memory_order_relaxed );
        for( size_t i = 10'000'000; i--; apss.compare_exchange_weak( ref, { } ) );
    };
    threads[0] = jthread( threadFn );
    threads[1] = jthread( threadFn );
}

There's almost no kernel-time consumed by the code. So the contention actually happens completely in user-space. I guess that's some kind of software transactional memory happening here.

The atomic may internally use locking mechanism ... call its is_lock_free() to figure if it is done "without a mutex". — Öö Tiib, Aug 14 '23 at 12:27
[std::atomic::is_lock_free](https://en.cppreference.com/w/cpp/atomic/atomic/is_lock_free) — Quimby, Aug 14 '23 at 12:27
The above code compiles, but doesn't link with GCC 13.1.0 (MinGW built by Brecht Sanders) for me. — Fureeish, Aug 14 '23 at 12:29
@EdisonvonMyosotis Have you checked the standard library source code, it may well do the locking (it may still use compiler intrinsics which will forward to the OS) — Pepijn Kramer, Aug 14 '23 at 12:30
@PepijnKramer The only external library call for the above two first lines in main() is to memcmp() with MSVC++. I think the code uses sth. like software transactional memory but I don't know how this actually works. — Edison von Myosotis, Aug 14 '23 at 12:34
@EdisonvonMyosotis weird, that actually worked, but it didn't require me to manually link against `atomic` for "simpler" (e.g., `atomic`) use-cases. And I have no idea why was that the case — Fureeish, Aug 14 '23 at 12:41
`std::atomic` does not imply that `T` is atomic on the hardware level. The point of `std::atomic` is that you need not know if `T` is atomic on the hardware level. Actually, even if `bool` is atomic for the hardware it is not for C++, but you need to use `std::atomic` — 463035818_is_not_an_ai, Aug 14 '23 at 12:41
@Fureeish https://stackoverflow.com/questions/76854480/latomic-flag-sometimes-not-required/76854542#76854542 — 463035818_is_not_an_ai, Aug 14 '23 at 12:42
FWIW there is no threading going on here so the compiler is within its rights to optimize al the atomic code away. Essentially your program can be optimized to `cout << sizeof ::apss << endl;` — NathanOliver, Aug 14 '23 at 12:42
What output do you get with `cout << (apss.is_lock_free() ? "LOCKFREE" : "MUTEX") << "\n";`? — Eljay, Aug 14 '23 at 12:52
@Eljay: The above atomic claims not be be lock-free, but if I constantly do compare_exchange_weak() from two threads I get two loaded cores without any kernel memory consumption. So the whole thing is happening in userspace and there must be some kind of software transactional memory here. — Edison von Myosotis, Aug 14 '23 at 12:58
@NathanOliver The first two lines actually aren't optimized away. — Edison von Myosotis, Aug 14 '23 at 13:00
@EdisonvonMyosotis Why would a mutex require kernel memory consumption? — Yakk - Adam Nevraumont, Aug 14 '23 at 13:20
`lock` doesn't imply a mutex, in some implementations it's implemented with a spin lock — Alan Birtles, Aug 14 '23 at 13:29
Relevant question: [Where is the lock for a std::atomic?](https://stackoverflow.com/q/50298358/580083) — Daniel Langr, Aug 14 '23 at 13:58
AFAIK there isn't any transactional memory mechanism that could feasibly be used here. Intel's TSX exists but isn't widely available; it was disabled by microcode updates on older CPUs due to security bugs, and is not being implemented on newer CPUs. I think you are going to find that a lock of some kind is being used. — Nate Eldredge, Aug 14 '23 at 14:26
I remember a question some time ago where we worked through a disassembly of MSVC's non-lock-free atomics and found that they added a spinlock as an extra hidden member of the struct. That would be consistent with your observation that both cores run 100% and no kernel resources are used. I can't find it now, unfortunately. — Nate Eldredge, Aug 14 '23 at 14:29
@NateEldredge This can be easily checked by using `sizeof`. A hidden member needs to occupy some storage. libstdc++ and libc++ seem to use another solution (hash table of locks indexed by the pointer to an atomic object), as written in the post I linked above. — Daniel Langr, Aug 14 '23 at 14:56
@Yakk-AdamNevraumont If theres no contention a mutex is completely locked in userspace, if there's contention the kernel participates in locking. — Edison von Myosotis, Aug 15 '23 at 09:18
@AlanBirtles Mutexes with partitial spinning are common, but pure spinlocks don't make sense in user space since a thread holding a spinlock could be scheduled away, thereby keeping contenders spinning. — Edison von Myosotis, Aug 15 '23 at 09:19
@NateEldredge Transactional memory is also possible in userspace without hardware support. That's called software transactional memory. STM is much less efficient than hardware transactional memory and because of that not used very often. — Edison von Myosotis, Aug 15 '23 at 09:20
@NateEldredge As I described spinlocks don't make sense in userspace. — Edison von Myosotis, Aug 15 '23 at 09:21
@EdisonvonMyosotis: You are absolutely right about the problem with spinlocks, but nevertheless that is what that previous disassembly showed. I too thought it was a strange design. I wish I could find it. I'll search some more. — Nate Eldredge, Aug 15 '23 at 15:16
@EdisonvonMyosotis: Aha, I found it: https://stackoverflow.com/questions/69245183/dwcas-alternative-with-no-help-of-the-kernel/70015983#70015983. It was for `atomic>`. Interestingly the OP there also initially guessed that transactional memory was involved. — Nate Eldredge, Aug 15 '23 at 15:26
@NateEldredge Software transactional memory and hardware transactional memory are very different to program. — Edison von Myosotis, Aug 16 '23 at 17:21

doron · Answer 1 · 2023-08-14T14:28:55.003

3

If there is no machine code primitive to perform the action without a lock, std::atomic will add the required lock to ensure things are atomic.

There is a even a compile time is_always_lock_free member that can be used to test this.

This is really important in contexts where mutexes cannot be used like signal handlers.

Edit: Worth adding that a good locking mechanism will use atomics in user-space and only defer to the kernel if there is contention. The futex on Linux is one such mechanism. This is used for mutexes on Linux.

edited Aug 14 '23 at 14:28

answered Aug 14 '23 at 14:18

doron

27,972
12
65
103

Is there any way to distinguish between situations where an implementation is aware of (and follows) a target platform's convention for locking, thus allowing interop with code outside the implementation, versus those where the implementation is unaware of such a convention and thus has to implement its own locking mechanism which would thus be unsuitable for interop with outside code? – supercat Aug 14 '23 at 20:53
Check the MSVC machine code - there's no locking for the above code. I gues the code uses software trasnactional memory. – Edison von Myosotis Aug 15 '23 at 03:19
1

@EdisonvonMyosotis: I would love to check the MSVC machine code, but I don't have MSVC readily available, nor do I know what version or compiler options you used. Would you please post it for us? – Nate Eldredge Aug 15 '23 at 15:32
@NateEldredge I use Visual Studio 2022 with the latest updates. – Edison von Myosotis Aug 16 '23 at 17:21
The non-lock-free fallback may just be a simple spinlock, or may use the same locking code as std::mutex (which yes on Linux will use `futex` if the lock is unavailable after some retries). Depends on the C++ standard library, or on the compiler's internal implementation of GNU C builtins like `__atomic_load_n`. See [Where is the lock for a std::atomic?](https://stackoverflow.com/q/50298358) – Peter Cordes Aug 17 '23 at 17:17
1

@EdisonvonMyosotis: https://godbolt.org/z/n417vbx6W shows MSVC 19.35 inlining a spinlock loop for `apss.load()`. Note the `xchg DWORD PTR std::atomic ~~apss, eax` and the branching involving a `pause` in the spin-wait loop.~~ – Peter Cordes Aug 17 '23 at 17:33
Not 100% sure of the implementation but I think a Windows CriticalSection will operate all userside if there is no contention. – doron Aug 18 '23 at 06:59

Alex Guteniev · Answer 2 · 2023-08-17T18:31:47.010

2

TL;DR: it is a userspace spinlock, a bad decision that is currently locked for some time for ABI reasons.

MSVC uses a spinlock for atomic but a SRWLOCK for atomic_ref

See the source:

    // Spinlock integer for non-lock-free atomic. <xthreads.h> mutex pointer for non-lock-free atomic_ref
    mutable typename _Atomic_storage_types<_Ty>::_Spinlock _Spinlock{};

Spinlock is currently considered a bad practice, specifically because it does not yield to the kernel, and can provoke long busy wait due to an unfortunate context switch.

This is acknowledged by MSVC STL maintainers, but due to ABI compatibility reasons, it cannot be fixed right now. A couple of years ago a PR was accepted that at least add pause instruction in the busy wait loop that makes situation a bit better, still no kernel wait.

With atomic_ref added in C++23 was able to go from scratch and use SRWLOCK which after some unspecified amount of unsuccessful spinning will go to kernel.

With the next ABI-breaking version, std::atomic is expected to use SRWLOCK too.

By the way, in MSVC each non-lock-free atomic has its own dedicated spinlock as a member, and likely to have its own SRWLOCK in the future. (Another possibility is a hash table of such object, which is effectively the only possibility for atomic_ref)

No, MSVC does not use transacted memory yet, neither for atomics, nor for anything else, except that some intrinsics are available. It looks like to me a good idea to use it for atomics though.

Sure I mean Hardware transactional memory (at least the Intel RTM, doubt that MSVC ever supported the AMD thing), and in an ABI-breaking version. I don't know much about software transactional memory.

edited Aug 17 '23 at 18:31

answered Aug 17 '23 at 13:26

Alex Guteniev

12,039
2
34
79

Other implementations, such as GCC and Clang (at least targeting non-Windows) do use a hash table of locks, keyed on the address of the atomic object. [Where is the lock for a std::atomic?](https://stackoverflow.com/q/50298358) . So they're not address-free, and won't work across processes in shared memory the way MSVC's will (?) with the lock inside the atomic object. Interesting, https://godbolt.org/z/ef5ndEsxo shows MSVC inlining the locking code for `.load()` on the OP's struct. – Peter Cordes Aug 17 '23 at 17:26
1

Note the OP said **software** transactional memory. That would be more expensive than just using a lock per object, since it allows different combinations of things to be read and written as atomic transactions. And it's not ABI-compatible with **hardware** transactional memory (like Intel TSX / RTM). With the HLE part of TSX disabled in microcode on current CPUs, we can't have nice things. (hardware lock elision made spinlocks work as transactions without actually contending over the spinlock's cache line.) – Peter Cordes Aug 17 '23 at 17:29
@PeterCordes, yes, I meant hardware transaction memory, specifically RTM, and with ABI break. I would not rely on MSVC non-`is_lock_free` `std::atomic` being address free, as the ABI breaking version is likely to use a `SWRLOCK`, not something custom with RTM or without it, and `SRWLOCK` isn't adress-free (it is like a futex-based nonrecursive lightweight shared mutex). – Alex Guteniev Aug 17 '23 at 18:41
1

Right yes, good point that it's not future-proof to rely on MSVC's `std::atomic` fallback locks being address-free. The ISO C++ standard recommends (with "should" phrasing IIRC) that `is_lock_free` atomics should be address-free, which is the case on all implementations I'm aware of, so software wanting to do shared memory across processes should be checking for `is_always_lock_free` for both portability (to non-Windows) and future-proofing. And besides, locking/unlocking every access sucks; it takes more code but a totally different fallback path using your own locking could be much better. – Peter Cordes Aug 17 '23 at 19:02

How do atomics larger than the CPU's native support work

2 Answers2