6

I get the feeling this may be a very general and common situation for which a well-known no-lock solution exists.

In a nutshell, I'm hoping there's approach like a readers/writer lock, but that doesn't require the readers to acquire a lock and thus can be better average performance.

Instead there'd be some atomic operations (128-bit CAS) for a reader, and a mutex for a writer. I'd have two copies of the data structure, a read-only one for the normally-successful queries, and an identical copy to be update under mutex protection. Once the data has been inserted into the writable copy, we make it the new readable copy. The old readable copy then gets inserted in turn, once all the pending readers have finished reading it, and the writer spins on the number of readers left until its zero, then modifies it in turn, and finally releases the mutex.

Or something like that.

Anything along these lines exist?

Asteroids With Wings
  • 17,071
  • 2
  • 21
  • 35
Swiss Frank
  • 1,985
  • 15
  • 33
  • [`std::atomic`](https://en.cppreference.com/w/cpp/atomic/atomic) – Mooing Duck Apr 15 '20 at 20:07
  • See also: [`boost::lockfree::queue`](https://www.boost.org/doc/libs/1_54_0/doc/html/boost/lockfree/queue.html) – Mooing Duck Apr 15 '20 at 20:08
  • yes, I'm a big novice at them but looking at that. It's all simple until the writer has to add the new record to the previously read-only copy of the tree. You can't until the readers are done reading, but to make sure you're counting every reader that has a copy of the read-only tree, requires (I think) having a reader-count field and pointer-to-read-only-copy right next to each other, and updating them with a 128-bit atomic update. The problem is that there's enough atomic operations, spinning, and still a write mutex, may be slower than readers-writer. – Swiss Frank Apr 15 '20 at 20:12
  • I know a lock-free queue but I don't see the applicability. I don't see how to look at this as a queue. – Swiss Frank Apr 15 '20 at 20:13
  • 1
    You have some threads writing data in, and some threads reading data out, it sounds like you might be able to use a queue with a little work – Mooing Duck Apr 15 '20 at 20:16
  • The normal advantages of CAS avoids spinning. If you have spinning, you're probably misusing CAS and should rethink what you're doing. – Mooing Duck Apr 15 '20 at 20:17
  • Actually, looking closer at what you have, I suspect you're using double-buffering, rather than a queue. If you have memory to spare, I've found it's often easier to use triple buffering. One writer thread has a buffer it mutates, and then uses CAS to swap it into the handoff-area, and then starts mutating the buffer it just removed. The reader thread starts with an empty buffer, uses CAS to swap with the one in the handoff-area, and reads out the data. At any time, there's three buffers, and neither thread is waiting. (If one thread gets too far ahead, you _might_ have to wait) – Mooing Duck Apr 15 '20 at 20:19
  • 2
    This sounds a lot like [left-right concurrency control](http://concurrencyfreaks.blogspot.com/2013/12/left-right-concurrency-control.html). It's wait-free for readers but writes are blocking with respect to other writers. – Eric Apr 15 '20 at 20:26
  • @Eric EXACTLY what I needed, thx! – Swiss Frank Jun 28 '20 at 16:06

4 Answers4

6

If your data fits in a 64-bit value, most systems can cheaply read/write that atomically, so just use std::atomic<my_struct>.

For smallish and/or infrequently-written data, there are a couple ways to make readers truly read-only on the shared data, not having to do any atomic RMW operations on a shared counter or anything. This allows read-side scaling to many threads without readers contending with each other (unlike a 128-bit atomic read on x86 using lock cmpxchg16b1, or taking a RWlock).

Ideally just an extra level of indirection via an atomic<T*> pointer (RCU), or just an extra load + compare-and-branch (SeqLock); no atomic RMWs or memory barriers stronger than acq/rel or anything else in the read side.

This can be appropriate for data that's read very frequently by many threads, e.g. a timestamp updated by a timer interrupt but read all over the place. Or a config setting that typically never changes.

If your data is larger and/or changes more frequently, one of the strategies suggested in other answers that requires a reader to still take a RWlock on something or atomically increment a counter will be more appropriate. This won't scale perfectly because each reader still needs to get exclusive ownership of the shared cache line containing lock or counter so it can modify it, but there's no such thing as a free lunch.

Note 1: Update: x86 vendors finally decided to guarantee that 128-bit SSE/AVX loads / stores are atomic on CPUs with AVX, so if you're lucky std::atomic<16-byte-struct> has cheap loads when running on a CPU with AVX enabled. e.g. not Pentium/Celeron before Ice Lake. GCC for a while has been indirecting to a libgcc atomic_load_16 function for 16-byte operations, so runtime dispatching for it can pick a lock cmpxchg16b strategy on CPUs that support it. Now it has a much better option to choose from on some CPUs.


RCU

It sounds like you're half-way to inventing RCU (Read Copy Update) where you update a pointer to the new version.

But remember a lock-free reader might stall after loading the pointer, so you have a deallocation problem. This is the hard part of RCU. In a kernel it can be solved by having sync points where you know that there are no readers older than some time t, and thus can free old versions. There are some user-space implementations. https://en.wikipedia.org/wiki/Read-copy-update and https://lwn.net/Articles/262464/.

For RCU, the less frequent the changes, the larger a data structure you can justify copying. e.g. even a moderate-sized tree could be doable if it's only ever changed interactively by an admin, while readers are running on dozens of cores all checking something in parallel. e.g. kernel config settings are one thing where RCU is great in Linux.


SeqLock

If your data is small (e.g. a 64-bit timestamp on a 32-bit machine), another good option is a SeqLock. Readers check a sequence counter before/after non-atomic copy of the data into a private buffer. If the sequence counters match, we know there wasn't tearing. (Writers mutually exclude each with a separate mutex). Implementing 64 bit atomic counter with 32 bit atomics / how to implement a seqlock lock using c++11 atomic library.

It's a bit of a hack in C++ to write something that can compile efficiently to a non-atomic copy that might have tearing, because inevitably that's data-race UB. (Unless you use std::atomic<long> with mo_relaxed for each chunk separately, but then you're defeating the compiler from using movdqu or something to copy 16 bytes at once.)

A SeqLock makes the reader copy the whole thing (or ideally just load it into registers) every read so it's only ever appropriate for a small struct or 128-bit integer or something. But for less than 64 bytes of data it can be quite good, better than having readers use lock cmpxchg16b for a 128-bit datum if you have many readers and infrequent writes.

It's not lock-free, though: a writer that sleeps while modifying the SeqLock could get readers stuck retrying indefinitely. For a small SeqLock the window is small, and obviously you want to have all the data ready before you do the first sequence-counter update to minimize the chance for an interrupt pausing the writer in mid update.

The best case is when there's only 1 writer so it doesn't have to do any locking; it knows nothing else will be modifying the sequence counter.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • "remember a lock-free reader might stall after loading the pointer" Yes, I follow you. The writer updates the writable version, puts it in place to be found as the next-generation read-only, other readers can start reading it; writer has to wait for readers to drain from the old read-only before updating it and one reader is blocked for minutes on end? Writer also blocked... (OTOH that's the case if a writer's waiting for a mutex held by another writer who goes to sleep, isn't it? And my readers become writers to add data.) See my answer below if you have time; I'd love your thoughts. – Swiss Frank Apr 16 '20 at 04:56
  • btw what is "tearing?" – Swiss Frank Apr 16 '20 at 04:56
  • My idea above was bad. But it was: pointer-to-RO-copy and count-of-readers are side by side. If they were 0x1234 and 7, a reader would try to CAS to 0x1234 and 8. If they succeed, they can use the pointer knowing they're accurately reflected in the reader count. Once a writer has the next-generation RO copy ready they store it in the other slot of a 2-element array with 0 readers. Then, they NULL out the old RO pointer, causing the reader CAS to fail (pointer no longer 0x1234.) When readers finish they decrement reader count, which writer is spinning on. At 0, writer can update old RO. – Swiss Frank Apr 16 '20 at 05:01
  • @SwissFrank: tearing is what happens when a load or store *isn't* atomic. e.g. a producer storing `0xFFFF` and `0x0000` alternating, a torn read could be `0xFF00` or `0x00FF` (or of course a wider version of that). A visual analogy version of the same effect is what happens when you play a game without vsync. – Peter Cordes Apr 16 '20 at 05:17
  • Thx Peter, I know the effect, just couldn't place the term for a second. BTW **do** we ever see tearing on <= 8byte data, that's 8-byte aligned, on x86, or do we use atomic<> just to document we need that, and be portable at some point to some microcontroller with a smaller data bus or something? – Swiss Frank Apr 16 '20 at 05:30
  • 2
    @SwissFrank: [Why is integer assignment on a naturally aligned variable atomic on x86?](https://stackoverflow.com/q/36624881) - `atomic` (even with `mo_relaxed`) is necessary to make sure loads/stores actually happen to memory, instead of keeping a copy of the C object in a register, basically what `volatile` does ([When to use volatile with multi threading?](https://stackoverflow.com/a/58535118) - basically never). Assuming a sane compiler and a 64-bit ABI with `alignof(int64_t)=8`, you wouldn't see tearing, just every other problem. – Peter Cordes Apr 16 '20 at 05:59
  • 1
    You *could* see a value that's simultaneously 0 and -1 if the compiler chooses to reload for some reason (https://lwn.net/Articles/793253/#Invented%20Loads); it can assume that's safe because data-race UB is UB, and compilers can assume UB never happens (equivalently, it doesn't matter what the machine code does for executions that encounter UB). To implement a SeqLock you usually need hacks like `volatile` or memory barriers like `asm("" ::: "memory")` to make sure loads really happen if you're trying to avoid making the protected data be `atomic` or an array of `atomic`. – Peter Cordes Apr 16 '20 at 05:59
  • OK Peter, I've reviewed RCU and if I understand it right, I do something quite similar (albeit with as many old copies as the consumers need) for a different part of the app. But for this part, the data structure I'm protecting is a tree structure with say 10,000 nodes. I can't just create copies on the fly then delete the old read-only copy after I get rid of it. Instead I was planning on something very much like Left-Right (which I've pseudocoded below). – Swiss Frank Apr 16 '20 at 17:45
  • @SwissFrank: ok, that wasn't clear from the question. For your case, SeqLock is clearly not viable. And RCU is only appropriate for such large data if writes are *very* infrequent, and it's worth the costs to make parallel reads maximally cheap and scalable with many parallel readers. Updated my answer to clarify what SeqLock and RCU are useful for and when they're not. Hopefully this answer is useful for some future readers whose question matches this title but where details are different. – Peter Cordes Apr 16 '20 at 18:12
  • "If your data fits in a 64-bit value, most systems can cheaply read/write that atomically, so just use std::atomic" In fact all but about 4 Pentium-alikes from AMD from like 2000-2001 can do 128-bit CAS. A bug in one g++ (7?) will NOT issue that instruction even if you use compile flags to target newer CPUs explicitly; instead you get a mutex and performance is 30x slower. – Swiss Frank Apr 18 '20 at 02:25
  • 1
    @SwissFrank: I said *cheaply*, which excludes reading via an atomic RMW like `lock cmpxchg16b`. You'd still want a SeqLock for 128-bit data on modern x86-64 if it's infrequently written and read in parallel by many cores. In "most" I was including non-x86 including still widespread 32-bit ARM (where modern ARM can use `ldp` on a pair of 32-bit registers). Also you're being a bit generous with your dates for x86: [AMD64 silicon was new in late 2003](https://en.wikipedia.org/wiki/AMD_K8) and didn't support CX16 (hence it not being baseline for x86-64), and Intel Nocona P4 then Core2 came later. – Peter Cordes Apr 18 '20 at 02:41
  • @SwissFrank: GCC7 and later never inline `lock cmpxchg16b` for `__atomic` builtins even with `-mcx16`; this was an intentional change ([Genuinely test std::atomic is lock-free or not](https://stackoverflow.com/a/49863341) has links). Dynamically-linked `libatomic` still uses `lock cmpxchg16b` on CPUs that are capable of it. Did you statically link so it picked the safe fallback that uses a mutex? That could be why it was 30x slower. Otherwise if you have a real-world use-case where GCC was 30x slower, maybe devs would be willing to change back. – Peter Cordes Apr 18 '20 at 02:54
  • Many thanks and I'll follow up your leads to self-study a bit. This particular project will only be server-grade Intel newer than 2010 or something but who knows about my next job or project... – Swiss Frank Apr 18 '20 at 02:57
  • @SwissFrank: Oops, linked the wrong answer on [Genuinely test std::atomic is lock-free or not](https://stackoverflow.com/a/49848879). I meant mine, and the actual GCC7 `__atomic` builtin change was first proposed / explained on GCC's mailing list: https://gcc.gnu.org/legacy-ml/gcc-patches/2017-01/msg02344.html. The intent was to not report it as `lock_free`, but still end up using the same `lock cmpxchg16b` inside a library function for atomic ops. Passing args to an actual CAS function may be less good and create more store/reloads that slow down the uncontended case... – Peter Cordes Apr 18 '20 at 03:01
3

What you're describing is very similar to double instance locking and left-right concurrency control.

In terms of progress guarantees, the difference between the two is that the former is lock-free for readers while the latter is wait-free. Both are blocking for writers.

Eric
  • 906
  • 7
  • 13
  • 1
    Note that Double-instance locking is lock-free for readers as far as progress guarantees (at least one of the RW locks will be available at any given moment), but a reader acquiring an RW lock still has to do an atomic RMW operation on something (I think?). So readers still effectively contend with each other, unlike a SeqLock or RCU. But this is viable for data that changes more frequently and/or is larger than would be good for either of those, at the cost of imperfect read-side scalability. – Peter Cordes Apr 15 '20 at 21:39
  • 1
    Reading the the first paragraph of Left Right's section 2. Design Overview, it is EXACTLY what I was thinking. Literally every word applies to my idea. I'm keen to finish this paper to see how they implement it. I'll give my own solution as an answer in pseudocode. Thanks for these links! – Swiss Frank Apr 16 '20 at 08:02
  • 1
    @SwissFrank FYI: my [xenium library](https://github.com/mpoeter/xenium) provides implementations for `left_right` and `seqlock`. It is actually possible to make the read operation of a seqlock also lock-free (reads don't even have to be announced), but a seqlock can only handle trivially copyable types. – mpoeter Apr 16 '20 at 10:13
  • Having finished the Left-Right journal article, it seems identical to the algo I had in mind (which I went ahead and show as pseudocode below), with the exception that I have to spin a little as reader to both get the read-only copy and also increment its reader count. I don't quite understand the Left-Right's alternative for managing the count yet so must study it more. – Swiss Frank Apr 16 '20 at 17:40
  • @mpoeter thx, I see you have the Michael Scott queue as well. I implemented that but the compiler I was on that day (g++ 7?) has kind of a bug where it refuses to use the 128-bit machine language instruction for 128-bit CAS, instead using a mutex, so the queue was not only not no-lock but also 40x slower than my alternative. I ended up writing something I call the PoolQueue, a single lock MWSR queue, that uses the queue linked list to also provide storage pool for heterogeneous-sized queue items. Memory management for the messages is also a performance issue, so kill two birds one stone. – Swiss Frank Apr 16 '20 at 17:58
  • 1
    @SwissFrank my implementation of the Michael-Schott queue uses only simple single-pointer-CAS. Instead it relies on one of the provided memory reclamation schemes to avoid the ABA problem. – mpoeter Apr 16 '20 at 20:44
1

It turns out the two-structure solution I was thinking of has similarities to http://concurrencyfreaks.blogspot.com/2013/12/left-right-concurrency-control.html

Here's the specific data structure and pseudocode I had in mind.

We have two copies of some arbitrary data structure called MyMap allocated, and two pointers out of a group of three pointers point to these two. Initially, one is pointed to by achReadOnly[0].pmap and the other by pmapMutable.

A quick note on achReadOnly: it has a normal state and two temporary states. The normal state will be (WLOG for cell 0/1):

achReadOnly = { { pointer to one data structure, number of current readers },
                { nullptr, 0 } }
pmapMutable = pointer to the other data structure

When we've finished mutating "the other," we store it in the unused slot of the array as it is the next-generation read-only and it's fine for readers to start accessing it.

achReadOnly = { { pointer to one data structure, number of old readers },
                { pointer to the other data structure, number of new readers } }
pmapMutable = pointer to the other data structure

The writer then clears the pointer to "the one", the previous-generation readonly, forcing readers to go to the next-generation one. We move that to pmapMutable.

achReadOnly = { { nullptr, number of old readers },
                { pointer to the other data structure, number of new readers } }
pmapMutable = pointer to the one data structure

The writer then spins for number of old readers to hit one (itself) at which point it can receive the same update. That 1 is overwritten with 0 to clean up in preparation to move forward. Though in fact it could be left dirty as it won't be referred to before being overwritten.

struct CountedHandle {
    MyMap*   pmap;
    int      iReaders;
};

// Data Structure:
atomic<CountedHandle> achReadOnly[2];
MyMap* pmapMutable;
mutex_t muxMutable;

data Read( key ) {
    int iWhich = 0;
    CountedHandle chNow, chUpdate;

    // Spin if necessary to update the reader counter on a pmap, and/or
    // to find a pmap (as the pointer will be overwritten with nullptr once
    // a writer has finished updating the mutable copy and made it the next-
    // generation read-only in the other slot of achReadOnly[].

    do {
        chNow = achReadOnly[ iWhich ];
        if ( !chNow .pmap ) {
            iWhich = 1 - iWhich;
            continue;
        }
        chUpdate = chNow;
        chNow.iReaders++;
    } while ( CAS( ach[ iWhich ], chNow, chUpdate ) fails );

    // Now we've found a map, AND registered ourselves as a reader of it atomicly.
    // Importantly, it is impossible any reader has this pointer but isn't
    // represented in that count.

    if ( data = chnow.pmap->Find( key ) ) {
        // Deregister ourselves as a reader.
        do {
            chNow = achReadOnly[ iWhich ];
            chUpdate = chNow;
            chNow.iReaders--;
        } while ( CAS( ach[ iWhich ], chNow, chUpdate ) fails );

        return data;
    }

    // OK, we have to add it to the structure.

    lock muxMutable;
    figure out data for this key
    pmapMutable->Add( key, data );

    // It's now the next-generation read-only.  Put it where readers can find it.
    achReadOnly[ 1 - iWhich ].pmap = pmapMutable;

    // Prev-generation readonly is our Mutable now, though we can't change it
    // until the readers are gone.
    pmapMutable = achReadOnly[ iWhich ].pmap;

    // Force readers to look for the next-generation readonly.
    achReadOnly[ iWhich ].pmap = nullptr;

    // Spin until all readers finish with previous-generation readonly.
    // Remember we added ourselves as reader so wait for 1, not 0.

    while ( achReadOnly[ iWhich ].iReaders > 1 }
        ;

    // Remove our reader count.
    achReadOnly[ iWhich ].iReaders = 0;

    // No more readers for previous-generation readonly, so we can now write to it.
    pmapMutable->Add( key, data );

    unlock muxMutable;

    return data;

}
Swiss Frank
  • 1,985
  • 15
  • 33
0

Solution that has come to me:

Every thread has a thread_local copy of the data structure, and this can be queried at will without locks. Any time you find your data, great, you're done.

If you do NOT find your data, then you acquire a mutex for the master copy.

This will have potentially many new insertions in it from other threads (possibly including the data you need!). Check to see if it has your data and if not insert it.

Finally, copy all the recent updates--including the entry for the data you need--to your own thread_local copy. Release mutex and done.

Readers can read all day long, in parallel, even when updates are happening, without locks. A lock is only needed when writing, (or sometimes when catching up). This general approach would work for a wide range of underlying data structures. QED


Having many thread_local indexes sounds memory-inefficient if you have lots of threads using this structure.

However, the data found by the index, if it's read-only, need only have one copy, referred to by many indices. (Luckily, that is my case.)

Also, many threads might not be randomly accessing the full range of entries; maybe some only need a few entries and will very quickly reach a final state where their local copy of the structure can find all the data needed, before it grows much. And yet many other threads may not refer to this at all. (Luckily, that is my case.)

Finally, to "copy all the recent updates" it'd help if all new data added to the structure were, say, pushed onto the end of a vector so given that say you have 4000 entries in your local copy, the master copy has 4020, you can with a few machine cycles locate the 20 objects you need to add. (Luckily, that is my case.)

Swiss Frank
  • 1,985
  • 15
  • 33