C++ atomics and memory_order with RDMA

Question

When using one-sided RDMA on modern memory, lock-free, the question arises of how a remote reader can safely view their incoming data if the data objects span multiple cache lines.

In the Derecho open-source multicast and replicated logging library (on https://GitHub.com/Derecho-Project) we have this pattern. A writer W is granted permission to write to a range of memory in a reader, R. Memory is properly pinned and mapped. Now, suppose that the write involves some sort of vector of data spanning many cache lines, which is common. We use a guard: a counter (also in RDMA accessible memory, but in some other cache line) that gets incremented. R spins, watching the counter… when it sees a change, this tells R “you have a new message”, and R then reads the data in the vector. Later we have a second pattern whereby R says to W, “I am done with that message, you can send another one.”

My question: With modern memory models, which flavor of C++ atomic should be used for the memory into which the vector will be written? Would this be denoted as relaxed consistency? I want my code to work on ARM and AMD, not just Intel with its strong TSO memory model.

Then for my counter, when R spins watching for the counter update, how do I want the counter declared? Would it need to be declared as an acquire-release atomic?

Finally, is there any merit to declaring everything as relaxed, but then using a memory_order fence here, after R observes the counter to have been incremented, in terms of speed or correctness? My thinking is that with this second approach, I use a minimum consistency model on all the RDMA memory (and the same model for all such memory), plus I only need to invoke the more costly memory_order fence after the counter is observed to increment. So it happens just once, prior to accessing my vector, whereas the acquire release atomic counter would trigger a memory fencing mechanism every time my polling thread loops. To me this sounds hugely expensive.

That last thought leads to one more question: must I also declare this memory as volatile, so that the C— compiler will realize the data can change under its feet, or does it suffice that the compiler itself can see the std::atomic type declarations? On Intel, with total store ordering, TSO plus volatile is definitely needed.

[Edit: New information] (I'm trying to attract a bit of help here!)

One option seems to be to declare the RDMA memory region as std::atomic<relaxed_consistency> but then to use a lock every time our predicate evaluation thread retests the guard (which, being in RDMA memory, would be declared with this same relaxed property). We would retain the C++ volatile annotation.

The reasoning is that with the lock, which has acquire-release semantics, the memory coherence hardware would be warned that it needs to fence prior updates. The lock itself (the mutex) can be declared local to the predicate thread, and would then live in local DRAM, which is cheap, and since this is not a lock anything contends for, locking it is probably as inexpensive as a test_and_set, and unlocking is just a write of 0. If the predicate is true, our triggered code body is running after the lock was accessed (probably after the lock release), so we establish the sequential ordering needed to ensure that the hardware will fetch the guarded object using actual memory reads. But every cycle through our predicate testing -- every "spin" -- we end up doing a lock acquire/release on every predicate. So this causes some slowdown.

Option two, seemingly less overhead, also declares the RDMA region as std::atomic with relaxed consistency, but omits the lock and does testing as we do now. Then when a predicate tests true, we would execute an explicit memory-fence (std::memory-order) with semantics. We get the same barrier, but only pay the cost when predicates evaluate to true, hence less overhead.

But now we run into a question of a different kind. Intel has total store order, TSO, and because any thread does some write-then-read actions, Intel is probably forced to fetch the guard variables from memory out of precaution, worrying that TSO could otherwise be violated. C++ with volatile is sure to include the fetch instruction. But on ARM and AMD, is it possible that the hardware itself might stash some guard variable for a very long time in a hardware register or something, causing extreme delays in our "spin-like" loop? Not knowing anything about ARM and AMD, this seems like a worry. But perhaps one of you knows a lot more than I do?

That really can't be answered without looking at it case by case. I can honestly answer yes and no to every question because you gave examples for both sides. You probably also need extra library support for RDMA, the normal std::atomics won't tell the remote that you are doing an atomic access. — Goswin von Brederlow, May 11 '22 at 12:33
*how a remote reader can safely view their incoming data if the data objects span multiple cache lines.* - One interesting technique to allow reading a consistent snapshot is the SeqLock (https://en.wikipedia.org/wiki/Seqlock), which allows a truly read-only reader and write-only writer. (See also [how to implement a seqlock lock using c++11 atomic library](https://stackoverflow.com/q/20342691) / [GCC reordering up across load with \`memory\_order\_seq\_cst\`. Is this allowed?](https://stackoverflow.com/q/36958372) for some example implementations in C++) — Peter Cordes, May 11 '22 at 12:42
A SeqLock does need write ordering by the writer and read ordering by the reader, naturally. — Peter Cordes, May 11 '22 at 12:43
_Later we have a second pattern whereby R says to W, “I am done with that message, you can send another one.”_ is this second pattern using the same memory region ? — Yann Droneaud, May 11 '22 at 12:47
@GoswinvonBrederlow we already have Derecho working and have used it for years, but only on Intel. So the atomics are sort of changing the hardware under our feet. But obviously we do have a lot of support and knowledge of RDMA. The issue is now to adapt to the new hardware. — Ken Birman, May 11 '22 at 12:48
@PeterCordes sounds very exciting! Does it work with IO? My issue is one that could arise in any system doing DMA IO, even a disk… basically, these modern smart controllers are like an additional core sharing memory, but in a lock free concurrent update pattern. — Ken Birman, May 11 '22 at 12:48
@YannDroneaud yes, same pattern. There is just a counter from R to W, in a different chunk of one sided RDMA memory hosted on W, where R says “I’ve processed up through message 17”. Seeing this, W can reuse the memory. We use it to support a round robin circular buffer, in the obvious way. But the issue is that with these asynchronous writes into multiple cache lines, in hardware with many layers of hardware caching, stashing, prefetching, etc, you aren’t certain to get memory coherence unless you do it exactly right… — Ken Birman, May 11 '22 at 12:51
I should maybe add a bit more detail for @GoswinvonBrederlow. Both sides are sharing the same declarations, so in fact R and W both are compiled using std::atomics. The RDMA write is, of course, asynchronous and just happening under R’s feet. But the intent here is for R to signal to its own hardware, it’s own CPU cache coherence hardware, that IO can occur into that memory area. Then the fencing is to ensure that when R has seen the counter tick, it will also see the prior writes to the vector of data in those other cache lines. In principle, just what atomics are for… — Ken Birman, May 11 '22 at 12:56
I think your use-case cares about seeing every message, not like updating a "current time" value where readers just need *a* consistent snapshot of it, but don't care about seeing every tick, and definitely no flow control. So no, a SeqLock wouldn't be useful. Also *spanning many cache lines* wouldn't be a good use-case either; the reader has to copy the data to private memory in between checks of the sequence number, and you'd want to avoid extra copying. When I commented about a SeqLock, I'd only just started skimming. — Peter Cordes, May 11 '22 at 13:05
My point was that just because you write a value with std::atomic doesn't mean the data is send over the network via RDMA properly. I would assume the std::atomic will only protect you from access on the same host and does nothing for remotes. But maybe I just don't know enough about how RDMA interacts with the cache coherency and lock prefix in the hardware. — Goswin von Brederlow, May 11 '22 at 13:12
On W, I would have used IBV_WR_SEND semantic instead of IBV_WR_RDMA_WRITE to get the notification on R's completion queue. This would allows for more than one _“I am done with that message, you can send another one.”_ to be pipelined. And it would solve your coherency issue. — Yann Droneaud, May 11 '22 at 13:14
Well, RDMA is just another kind of core talking to the same memory coherence layer. In principle everything I’ve asked can even arise with two threads on different cores, using memcpy rather than RDMA. The issue centers on the hardware cache coherence properties and fencing… — Ken Birman, May 11 '22 at 13:15
Unfortunately, I think that there is no behavior on the W side that can help. As I see it, the entire question is the required but cheapest behavior on the R side. W indeed will know that from its own perspective, the writes have completed. But R is a different core and has its own hierarchy of caches that could contaminate the reads, and the writes may actually still be in flight even when R’s hardware lets the RDMA NIC send the ack back to W. — Ken Birman, May 11 '22 at 13:18
With IBV_WR_SEND on W, R would have to post a receive work request, and once completion of W write, R would see a completion queue entry on its completion queue. — Yann Droneaud, May 11 '22 at 13:23
Yes, but this strikes me a a somewhat indirect way to trigger a memory fence. My desire is to directly trigger a memory fence and do it at the optimal point in the code, minimizing overheads — Ken Birman, May 11 '22 at 13:27
Additional criticism of forcing R to post a request and see a completion: this would be hugely expensive! Granted, R wouldn’t have to do this in the polling loop that watches the counter. But the cost of safely seeing the RDMA incoming data would suddenly jump to include creating this verb object, enqueuing it, watching a second queue for it to complete… — Ken Birman, May 11 '22 at 13:40
@Yann Droneaud, this is a tangent but I agree, spinning to watch just one counter would waste a whole core. In fact our solution quiesces if nothing much is going on, at which point we use an interrupt to wake it up. Additionally, the thread that does this polling is really looping over a bunch of such counters, maybe 20 of them. So in practice, we aren't wasting a whole core by pinning it to a spinning thread... — Ken Birman, May 11 '22 at 18:25

score 0 · Accepted Answer · answered May 19 '22 at 17:13

Well, there seems to be a lack of expertise on this issue at this time. Probably the newness of the std::atomics options and the general sense of uncertainty about precisely how ARM and AMD will implement relaxed consistency make it hard for people to know the answer, and speculation isn't helpful.

As I'm understanding this, the right answers seem to be:

The entire problem won't be seen on Intel because of its TSO (total store order) policy. With TSO, because the guard gets updated after the vector it guards, hence in any total store order, the guard was updated last. Seeing the guard change then guarantees that the receiver will see the updated vector elements. Moreover, the default on AMD and ARM is likely to mimic TSO.
By explicitly declaring the RDMA memory region to have relaxed_consistency, a developer is opting for a cheaper memory model, but taking on the obligation to insert a memory fence. The most obvious way to do this is to just acquire a lock before reading the guard, then release the lock after doing so. This has a cost even if no other thread contends for the lock. First, the lock operation itself requires a few clock cycles. But more broadly, locking a random mutex will have some unknown impact on caches because the hardware must assume that the lock actually is contended for, a wait may have occurred, and values could have changed under its feet. This will result in a cost that needs to be quantified.
Equivalently, the guard can be declared to use acquire_release consistency. Seemingly, this creates a memory fence and the prior updates used to write the vector will be visible to any reader who has seen the guard value change. Again, cost needs to be quantified.
Perhaps, one could do a fenced read at the top of the code block triggered by the predicate. This would get the fence out of the main predicate loop, so the costs of the fence would only be paid once, and only paid when the predicate actually is true.

We also need to tag our atomics as volatile in C++. In fact, C++ probably should notice when a std::atomic type is accessed, and treat that like access to a volatile. However, at present it isn't obvious that C++ compilers are implementing this policy.

Current C++ compilers don't optimize atomics at all, they *do* treat them like `volatile atomic`. That could possibly change in the future, once the standards committee figures out a way to give useful guarantees that compilers can't sink all the relaxed atomic stores out of a progress-bar update loop or something, and without breaking existing code that doesn't use `volatile atomic`. See [Why don't compilers merge redundant std::atomic writes?](https://stackoverflow.com/q/45960387) for more, and links to WG proposals about when it would be useful for compilers to optimize. — Peter Cordes, May 19 '22 at 17:40
Thanks Peter, useful to know! I had heard something along these lines but without any source cited. — Ken Birman, May 20 '22 at 18:05

C++ atomics and memory_order with RDMA

1 Answers1