Can/should non-lock-free atomics be implemented with a SeqLock?

Question

In both MSVC STL and LLVM libc++ implementations std::atomic for non-atomic size is implemented using a spin lock.

libc++ (Github):

  _LIBCPP_INLINE_VISIBILITY void __lock() const volatile {
    while(1 == __cxx_atomic_exchange(&__a_lock, _LIBCPP_ATOMIC_FLAG_TYPE(true), memory_order_acquire))
        /*spin*/;
  }
  _LIBCPP_INLINE_VISIBILITY void __lock() const {
    while(1 == __cxx_atomic_exchange(&__a_lock, _LIBCPP_ATOMIC_FLAG_TYPE(true), memory_order_acquire))
        /*spin*/;
  }

MSVC (Github) (recently discussed in this Q&A):

inline void _Atomic_lock_acquire(long& _Spinlock) noexcept {
#if defined(_M_IX86) || (defined(_M_X64) && !defined(_M_ARM64EC))
    // Algorithm from Intel(R) 64 and IA-32 Architectures Optimization Reference Manual, May 2020
    // Example 2-4. Contended Locks with Increasing Back-off Example - Improved Version, page 2-22
    // The code in mentioned manual is covered by the 0BSD license.
    int _Current_backoff   = 1;
    const int _Max_backoff = 64;
    while (_InterlockedExchange(&_Spinlock, 1) != 0) {
        while (__iso_volatile_load32(&reinterpret_cast<int&>(_Spinlock)) != 0) {
            for (int _Count_down = _Current_backoff; _Count_down != 0; --_Count_down) {
                _mm_pause();
            }
            _Current_backoff = _Current_backoff < _Max_backoff ? _Current_backoff << 1 : _Max_backoff;
        }
    }
#elif
/* ... */
#endif
}

While thinking of a better possible implementation, I wonder if it is feasible to replace this with SeqLock? Advantage would be cheap reads if reads don't contend with writes.

Another thing I'm questioning is if SeqLock can be improved to use OS wait. It appears to me that if reader observes an odd count, it can wait with atomic wait underlying mechanism (Linux futex/Windows WaitOnAddress), thus avoiding the starvation problem of spinlock.

To me it looks like possible. Though C++ memory model doesn't cover Seqlock currently, types in std::atomic must be trivially copyable, so memcpy reads/writes in seqlock will work and will deal with races if sufficient barriers are used to get a volatile-equivalent without defeating optimizations too badly. This will be part of a specific C++ implementation's header files so it doesn't have to be portable.

Existing SO Q&As about implement a SeqLock in C++ (perhaps using other std::atomic operations)

Peter Cordes · Accepted Answer · 2022-02-20T22:49:45.597

Yes, you can use a SeqLock as a readers/writers lock if you provide mutual exclusion between writers. You'd still get read-side scalability, while writes and RMWs would stay about the same.

It's not a bad idea, although it has potential fairness problems for readers if you have very frequent writes. Maybe not a good idea for a mainstream standard library, at least not without some testing with some different workloads / use-cases on a range of hardware, since working great on some machines but faceplanting on others is not what you want for standard library stuff. (Code that wants great performance for its special case often unfortunately has to use an implementation that's tuned for it, not the standard one.)

Mutual exclusion is possible with a separate spinlock, or just using the low bit of the sequence number. In fact I've seen other descriptions of a SeqLock that assumed you'd be using it with multiple writers, and didn't even mention the single-writer case that allows pure-load and pure-store for the sequence number to avoid the cost of an atomic RMW.

How to use the sequence number as a spinlock

A writer or RMWer attempts to atomically CAS the sequence number to increment (if it wasn't already odd). If the sequence number is already odd, writers just spin until they see an even value.

This would mean writers have to start by reading the sequence number before trying to write, which can cause extra coherency traffic (MESI Share request, then RFO). On a machine that actually had a fetch_or in hardware, you could use that to atomically make the count odd and see if you won the race to take it from even to odd.

On x86-64, you can use lock bts to set the low bit and find out what the old low bit was, then load the whole sequence number if it was previously even (because you won the race, no other writer is going to be modifying it). So you can do a release-store of that plus 1 to "unlock" instead of needing a lock add.

Making other writers faster at reclaiming the lock may actually be a bad thing, though: you want to give a window for readers to complete. Maybe just use multiple pause instructions (or equivalent on non-x86) in write-side spin loops, more than in read-side spins. If contention is low, readers probably had time to see it before writers got to it, otherwise writers will frequently see it locked and go into the slower spin loop. Maybe with faster-increasing backoff for writers, too.

An LL/SC machine could (in asm at least) test-and-increment just as easily as CAS or TAS. I don't know how to write pure C++ that would compile to just that. fetch_or could compile efficiently for LL/SC, but still to a store even if it was already odd. (If you have to LL separately from SC, you might as well make the most of it and not store at all if it will be useless, and hope that the hardware is designed to make the best of things.)

(It's critical to not unconditionally increment; you must not unlock another writer's ownership of the lock. But an atomic-RMW that leaves the value unchanged is always ok for correctness, if not performance.)

It may not be a good idea by default because of bad results with heavy write activity making it potentially hard for a reader to get a successful read done. As Wikipedia points out:

The reader never blocks, but it may have to retry if a write is in progress; this speeds up the readers in the case where the data was not modified, since they do not have to acquire the lock as they would with a traditional read–write lock. Also, writers do not wait for readers, whereas with traditional read–write locks they do, leading to potential resource starvation in a situation where there are a number of readers (because the writer must wait for there to be no readers). Because of these two factors, seqlocks are more efficient than traditional read–write locks for the situation where there are many readers and few writers. The drawback is that if there is too much write activity or the reader is too slow, they might livelock (and the readers may starve).

The "too slow reader" problem is unlikely, just a small memcpy. Code shouldn't expect good results from std::atomic<T> for very large T; the general assumption is that you'd only bother with std::atomic for a T that can be lock-free on some implementations. (Usually not including transactional memory since mainstream implementations don't do that.)

But the "too much write" problem could still be real: SeqLock is best for read-mostly data. Readers may have a bad time with a heavy write mix, retrying even more than with a simple spinlock or a readers-writers lock.

It would be nice if there was a way to make this an option for an implementation, like an optional template parameter such as std::atomic<T, true>, or a #pragma, or #define before including <atomic>. Or a command-line options.

An optional template param affects every use of the type, but might be slightly less clunky than a separate class name like gnu::atomic_seqlock<T>. An optional template param would still make std::atomic types be that class name, so e.g. matching specializations of other things for std::atomic. But might break other things, IDK.

Might be fun to hack something up to experiment with.

For a `SeqLock` does the lock counter need to be on the same cache line as the data being read? Otherwise I don't see why you couldn't get stale data if the write changes to the payload propegate before the counter. — Noah, Nov 25 '21 at 18:26
@Noah: The writers and readers use memory barriers to ensure that doesn't happen. The initial write making the counter odd needs to be visible before any changes to the payload. This is a challenge to express in C++ since you can't do release stores on non-atomic variables, and `atomic_thread_fence(mo_release)` doesn't formally guarantee ordering wrt. later non-atomic stores, only in terms of syncs-with. Of course the whole SeqLock idea is UB, doing UB non-atomic reads and discarding if tearing might have happened, but it also depends on real-world ISA memory models aka implementation details — Peter Cordes, Nov 25 '21 at 21:59
I see, on x86 can you get away with the writer just using an `sfence` after first CAS increment of counter (making it odd) and after writing the payload? Should be impossible then for reader to see changed payload w.o odd counter or writer done (even + incremented) w.o changed payload. — Noah, Nov 25 '21 at 22:12
@Noah: huh? We're not using NT stores so `sfence` is irrelevant; x86-TSO already enforces StoreStore ordering so this is trivial on x86, just compile-time ordering is sufficient. Even with a single-writer SeqLock using just `mov` stores, later `mov` stores are guaranteed to be visible not-before the first store. Or in the multi-writer case, `lock bts` is already a full barrier on x86. The read side needs LoadLoad ordering between the first load (so mo_acquire is sufficient), and between the payload and the final load (`thread_fence(acquire)` which is a no-op on x86, or some `dsb` on ARM) — Peter Cordes, Nov 26 '21 at 01:44

Can/should non-lock-free atomics be implemented with a SeqLock?

1 Answers1

How to use the sequence number as a spinlock