1

Please consider the following synchronization problem:

initially:
    version = 0           // atomic variable
    data = 0              // normal variable (there could be many)

Thread A:
    version++
    data = 3

Thread B:
    d = data
    v = version
    assert(d != 3 || v == 1)

Basically, if thread B sees data = 3 then it must also see version++.

What's the weakest memory order and synchronization we must impose so that the assertion in thread B is always satisfied?

If I understand C++ memory_order correctly, the release-acquire ordering won't do because that guarantees that operations BEFORE version++, in thread A, will be seen by the operations AFTER v = version, in thread B.

Acquire and release fences also work in the same directions, but are more general.

As I said, I need the other direction: B sees data = 3 implies B sees version = 1.

I'm using this "versioning approach" to avoid locks as much as possible in a data structure I'm designing. When I see something has changed, I step back, read the new version and try again.

I'm trying to keep my code as portable as possible, but I'm targeting x86-64 CPUs.

Kiuhnm
  • 478
  • 3
  • 13

1 Answers1

1

You might be looking for a SeqLock, as long as your data doesn't include pointers. (If it does, then you might need something more like RCU to protect readers that might load a pointer, stall / sleep for a while, then deref that pointer much later.)

You can use the SeqLock sequence counter as the version number. (version = tmp_counter >> 1 since you need two increments per write of the payload to let readers detect tearing when reading the non-atomic data. And to make sure they see the data that goes with this sequence number. Make sure you don't read the atomic counter a 3rd time; use the local tmp that you read it into to verify match before/after copying data.)

Readers will have to retry if they happen to attempt a read while data is being modified. But it's non-atomic, so there's no way if thread B sees data = 3 can ever be part of what creates synchronization; it can only be something you see as a result of synchronizing with a version number from the writer.

See:

  • Implementing 64 bit atomic counter with 32 bit atomics - my attempt at a SeqLock in C++, with lots of comments. It's a bit of a hack because ISO C++'s data-race UB rules are overly strict; a SeqLock relies on detecting possible tearing and not using torn data, rather than avoiding concurrent access entirely. That's fine on a machine without hardware race detection so that doesn't fault (like all real CPUs), but C++ still calls that UB, even with volatile (although that puts it more into implementation-defined territory). In practice it's fine.

  • GCC reordering up across load with `memory_order_seq_cst`. Is this allowed? - A GCC bug fixed in 8.1 that could break a seqlock implementation.

If you have multiple writers, you can use the sequence-counter itself as a spinlock for mutual exclusion between writers. e.g. using an atomic_fetch_or or CAS to attempt to set the low bit to make the counter odd. (tmp = seq.fetch_or(1, std::memory_order_acq_rel);, hopefully compiling to x86 lock bts). If it previously didn't have the low bit set, this writer won the race, but if it did then you have to try again.

But with only a single writer, you don't need to RMW the atomic sequence counter, just store new values (ordered with writes to the payload), so you can either keep a local copy of it, or just do a relaxed load of it, and store tmp+1 and tmp+2.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thanks. Indeed, I reinvented SeqLock (although in a slightly more general form), but my analysis was flawed because I forgot that if I read a version with the lock bit set, then I'll retry again anyway, so what's matters is what happens when I read the result of the second update to the version. Since the update to the data comes before it, everything works out perfectly. – Kiuhnm Mar 28 '22 at 10:48
  • I have a lingering doubt. Since the synchronization with the second counter update is the only one that matters, can't we relax the first update? – Kiuhnm Mar 28 '22 at 11:11
  • @Kiuhnm: Not exactly. You should use `mo_relaxed` for it, but you need a `std::atomic_thread_fence(mo_release)` after it to make sure none of the updates to `data` can be visible before the first store to the sequence number, the one that makes it odd and lets all readers know that we're mid-update. (Just a `seq.store(val, release)` is only one-way ordering in the wrong direction. Same for `seq.store(val, seq_cst)` - AArch64 can implement that with `stlr` which only has one-way ordering except wrt. later `ldar`.) – Peter Cordes Mar 28 '22 at 20:17
  • Technically I'm not totally sure that a release *fence* guarantees ordering between an atomic store and later non-atomic stores; that may be an implementation detail. But as I said, a SeqLock already relies on data-race UB, unless you make the payload `atomic` with `mo_relaxed` but that would defeat a compiler's ability to copy it efficiently with SIMD, for example. This is all stuff that I think I mentioned in [Implementing 64 bit atomic counter with 32 bit atomics](https://stackoverflow.com/q/54611003) - if I didn't, let me know; been a while since I wrote it. – Peter Cordes Mar 28 '22 at 20:19
  • I see a contradiction in what you say. A release fence makes sure that stores BEFORE it are visible, not after. Otherwise, that would solve my initial question: if thread B sees "data=3" then it also sees "version=1" because a fence would make sure that data=3 is not visible before "version=1". But we know that that's FALSE. That's not how a release fence works. – Kiuhnm Mar 28 '22 at 22:52
  • @Kiuhnm: Stand-alone fences *are* 2-way barriers for stores, blocking all StoreStore reordering between earlier stores and later stores. https://preshing.com/20130922/acquire-and-release-fences/ / https://preshing.com/20131125/acquire-and-release-fences-dont-work-the-way-youd-expect/ (They also block LoadStore reordering, which is asymmetric unlike a seq_cst barrier.) – Peter Cordes Mar 28 '22 at 22:54
  • Not according to https://en.cppreference.com/w/cpp/atomic/atomic_thread_fence. See the definitions of how they work and what they guarantee. Note that release fences must come BEFORE the atomic stores and acquire fences AFTER the atomic loads. The only reason to use a fence instead of an operation with the same memory order, is if you have multiple operations and you don't want to pay the price multiple times. edit: Also if you don't always need to impose the order (if). See the example in the notes. – Kiuhnm Mar 28 '22 at 23:07
  • Basically, the scheme is this: OpsX ] RelFence AtomicStore --- AtomicLoad AcqFence [ OpsY. IF the AtomicLoad sees the result of the AtomicStore, THEN the two fences guarantee that the results of OpsX are seen by OpsY. Nothing more than that, according to the definitions. I think you're relying on implementation details and I'm not so sure anymore that SeqLock is guaranteed to work by only relying on the official guarantee. We need something more. – Kiuhnm Mar 28 '22 at 23:18
  • If modifications to the data are reordered BEFORE the first counter++, then everything breaks down, because the other thread could read the NEW data and twice the OLD counter. – Kiuhnm Mar 28 '22 at 23:23
  • @Kiuhnm: That's how you use fences to create a syncs-with relationship between relaxed load/store, i.e. the emulate a release-store. But that's not what we're doing with them in this case. cppreference used to have too weak a statement of what fences do, but currently does say "*While an atomic store-release operation prevents all preceding reads and writes from moving past the store-release, an `atomic_thread_fence` with `memory_order_release ordering` prevents **all preceding reads and writes** from moving past **all subsequent stores**"* Like Preshing said. – Peter Cordes Mar 29 '22 at 02:51
  • @Kiuhnm: If you insist on looking at it as acq/rel synchronization instead of how CPUs actually work, you could look at it in terms of acq/rel synchronizing on the non-atomic `data`. As I've explained multiple times, a SeqLock doesn't fully avoid UB unless you also make each separate `data` item `std::atomic` (and use mo_relaxed). (Which defeats optimizations for copying larger data items). But in asm for real CPUs, a non-atomic load has to compile to the same kind of load as atomic with relaxed, so barriers can create order. It's not guaranteed by ISO C++, but does work everywhere. – Peter Cordes Mar 29 '22 at 02:57
  • "prevents **all preceding reads and writes from moving past all subsequent stores**". If that wasn't the case, fences would be useless, as I noted before. But as I said before, **that's the wrong direction**. We want a fence that stop ops that comes **after** it from being executed **before** it, and that's **not** how the fences work. As I said, it works in practice because you're relying on implementation details. The other contradiction, as I said, is that if it works for real CPUs, then my question can be solved with a fence in Thread A. – Kiuhnm Apr 01 '22 at 22:19
  • The problem is not UB. I'll repeat what I said before "If modifications to the data are reordered BEFORE the first counter++, then everything breaks down, because the other thread could read the NEW data and twice the OLD counter." UB or not, the modification to the data will **not** be detected. – Kiuhnm Apr 01 '22 at 22:24
  • FACTS: (1) if fences work according to the cppreference, the current implementation of SeqLock is **not** guaranteed to work. (2) if fences, on the contrary, offer more guarantees, then SeqLock *does* work, and the problem in my question can be solved by adding a fence between `version++` and `data = 3`. Any way you slice it, there's a contradiction in what you say. If you don't agree, let's agree to disagree. – Kiuhnm Apr 01 '22 at 22:31
  • @Kiuhnm: In your question, if the thread A code only ever ran once, then yes I think you could fix it in practice with a fence. But if it could run repeatedly in that thread, you couldn't tell when a payload value might have come from mid update. That's the point of using a SeqLock. In the writer, we need to do `seq=tmp+1;` `fence(release);` `payload = whatever;` `seq.store(tmp+2, release);`. We need `fence(release)` to prevent any payload stores from being visible before the 1st seq store. That's the same thing as stopping the seq store from being visible only after the payload store – Peter Cordes Apr 01 '22 at 23:00
  • @Kiuhnm: In other words, that's not the wrong direction, that's *both* directions for stores. No stores before the barrier can reorder with any stores after the barrier, which implies the converse, too. We don't care about the writer's loads being reordered in one direction if it's the only writer. It doesn't matter what order its loads happen in, because it's only reading its own stores. In the reader, we have something like `do { tmp = seq.load(acquire); val = payload; fence(acquire); }while( (tmp & 1 == 0) && tmp == seq.load(relaxed));`. An acq fence blocks loads in both dirs. – Peter Cordes Apr 01 '22 at 23:01
  • @Kiuhnm: In a successful execution, `tmp = seq.load(acquire)` syncs-with `seq.store(tmp+2, release)`. If seq and payload were both atomic using `mo_seq_cst`, then C++ would guarantee that seeing the same value in `seq` afterwards would mean a new write of `payload` hadn't happened either, because of the total order across all seq_cst operations. We're weakening that saying that if you see a torn value for `payload`, that's like an acquire load that guarantees you'll also see an updated value for `seq` afterwards, since store/relfence/store is like a release store but stronger (but non-atomic) – Peter Cordes Apr 01 '22 at 23:14