Do you need to pair memory_order_acquire and memory_order_release around a block of code?

Question

I'm making a profiler for an interpreter, I need the interpreter to write the current frame position somewhere on every call. Then sample that information every X ms. I initially started with rigtorp's spinlock around the frame position, but that had quite an effect on the runtime of the interpreter (profiling pointed at the locking acquire time for every loop through the interpreter). So after reading quite some pages about memory fences I came up with a more efficient solution, but I would like to know if this is the correct interpretation of the relation between memory_order_relaxed & acquire/release.

#include <memory>
#include <chrono>
#include <string>
#include <iostream>
#include <thread>
#include <immintrin.h>
#include <cstring>
#include <atomic>

using namespace std;
 
typedef struct frame {
    uint8_t op;
    uint16_t arg;
    uint32_t check;
} frame;


static constexpr unsigned int FRAME_BUFFER_SIZE = 8 * 1024;

static atomic<unsigned int> index;
static frame frames[FRAME_BUFFER_SIZE];

static void writer() {
    uint8_t op = 0;
    uint16_t arg = 0;
    for (;;) {
        op++;
        arg++;
        const auto newIndex = index.load(memory_order_relaxed) + 1;
        auto &target = frames[newIndex % FRAME_BUFFER_SIZE];
        target.op = op;
        target.arg = arg;
        target.check = static_cast<uint32_t>(arg) + op;
        index.store(newIndex, memory_order_release);
        _mm_pause(); // give the other threads some time
    }
}

static void reader() {
    for (;;) {
        const auto lastValidIndex = index.load(memory_order_acquire);
        // we race, hoping that the FRAME_BUFFER_SIZE is enough room 
        // to avoid writter catching up to us
        const auto snapshot = frames[lastValidIndex % FRAME_BUFFER_SIZE];
        if ((static_cast<uint32_t>(snapshot.arg) + snapshot.op) != snapshot.check) {
            cout << "Invalid snapshot\n";
            exit(1);
        }
        // we sleep a bit, since the reader is only intendede to read once in a while
        this_thread::sleep_for(chrono::milliseconds(1)); 

    }
}

int main() {
    cout << "Starting race\n";
    index = 0;
    memset(frames, 0, sizeof(frames));
    thread w(writer);
    thread r(reader);
    w.join();
    r.join();
    return 0;
}

So the strategy is as follows, we have a circular buffer, where the writer is writing into, it's the only one that mutates the index variable, so the first read is memory_order_relaxed. Then we update the value in the array, and then we store the new "full" index, this time with a memory_order_release. The reader only reads the index (this case with a memory_order_acquire) and then indexes of that location in the array.

So my questions are:

does this pairing of fences guarantee that the writes to frames array happen before the update of the index?
is the cpu "cache" of the fences cleared in the read thread every time we do a memory_order_acquire?
is it indeed safe to do a memory_order_relaxed read of the index variable, since we know that our thread is the only one reading this, and we don't care about the values in the frames array?

Your code doesn't have any fences (`atomic_thread_fence` or `atomic_signal_fence`). It has various *operations* with some ordering (not relaxed), but operations and fences are different things. https://preshing.com/20131125/acquire-and-release-fences-dont-work-the-way-youd-expect/ explains how fences (2-way ordering) are different from operations (1-way ordering); see https://preshing.com/20120913/acquire-and-release-semantics/ for acq/rel. — Peter Cordes, Nov 18 '22 at 12:29
Re: your middle point, no CPU cache doesn't have to get flushed, [it's coherent](https://stackoverflow.com/a/58535118/224132). You're just controlling the order of visibility of this thread's accesses to it. https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ — Peter Cordes, Nov 18 '22 at 12:35
And "cache of the fences" makes no sense at all. Perhaps you mean caching of the atomic variable? The *compiler* will do a fresh load, not use an old value it kept in a register (registers are thread-private; when people talk about atomics avoiding "caching" values, anything about CPU caches is a distortion or misunderstanding of this concept.) — Peter Cordes, Nov 18 '22 at 12:35
As for the final point, I haven't looked at your code in a lot of detail yet. But yes, if you're the only writer of a variable, then it's always fine to do relaxed loads of it, and separately store a new value, with either relaxed, or `release` if other readers will want to also see other values in other variables upon seeing a certain value in the atomic var. e.g. use the load result to index an array. — Peter Cordes, Nov 18 '22 at 12:37
@PeterCordes thank you for your detailed replies. Sorry for messing up the terminology. I think I wrongefully applied the mental model proposed [in davmac's article](https://davmac.wordpress.com/2018/04/03/understanding-the-c-c-memory-model-part-2/). So every core get's its own read and write buffer (which acts a bit like a cache), and as I understood is, you have to make sure the cpu knows to clear those. But I'm getting the feeling I've messed up that whole point in this implementation. — Davy Landman, Nov 18 '22 at 12:45
So I want to try and share date between two threads, where the writer is continuously offering, while the reader only once a millisecond is interested, so I'm trying to find a way to make the writer not take too hard of a hit. — Davy Landman, Nov 18 '22 at 12:46
Yeah, CPUs have a [store buffer](https://stackoverflow.com/q/64141366/224132) to insulate actual execution from commit to globally-visible cache. Load buffers may be a useful mental model to abstract the actual reasons for load reordering (out-of-order exec, or in-order with hit-under-miss caches) but it's less of a real thing; a load execution unit does directly read from L1d cache, because execution has to wait for load values, but a single core doesn't really care how long it takes stores to get anywhere, as long as they happen and it can see its own stores in program-order. — Peter Cordes, Nov 18 '22 at 12:49
For your actual goal, yeah writing a buffer with some way to consistency-check it sound reasonable. Any data race is UB in C++ (simultaneous access on a non-atomic variable when at least one access is a write), but things like a seqlock rely on writing and then checking for tearing. At least one of the members of the struct should probably be `atomic<>`, though, like a sequence number. — Peter Cordes, Nov 18 '22 at 13:16
In fact, each array entry can just *be* a seqlock, or only write the sequence number once (as a `release` operation after writing everything else), but it has to match the current index. (You have the reader do the modulo, so it has the upper bits of the index to check against the sequence number in the bucket it reads.) If that bucket is torn, look at the previous one. Or just always look at the previous index unless there's some need to have the reader see a very recent sample? I guess if some interpreter steps are slower than others, you want to catch those, not the one before? — Peter Cordes, Nov 18 '22 at 13:20
[Implementing 64 bit atomic counter with 32 bit atomics](https://stackoverflow.com/q/54611003) - So like `struct {atomic seq; uint16; uint8;};` (Generally sort largest first to minimize padding for alignment, although actually you'd get 1 byte either way, just at the end instead of between u8 and u16.) `unsigned idx` is a local var, you don't need to keep reloading `index`, although it's fine especially if a compiler wouldn't have kept it a local in a register in a real-world version of this tiny example . — Peter Cordes, Nov 18 '22 at 13:30
`++idx;` / `arr[idx % size].seq.store(idx, relaxed);` (make sure new sequence number is visible before later stores) `atomic_thread_fence(release)` / non-atomic `arr[idx % size].fields = stuff;` / `index.store(idx, release);`. So like a seqlock, except one store is to the bucket, one store is to `index`. Actually you'd be fine with probably even an 8-bit sequence number if you set it from the upper bits of the index, the ones that aren't redundant with the actual array index; unless 256 wrap-arounds before another read is plausible. — Peter Cordes, Nov 18 '22 at 13:32
(That `atomic_thread_fence(release)` to order an atomic store before some non-atomic assignments is not strictly safe or guaranteed without the payload actually being atomic with `relaxed`, but works in practice.) — Peter Cordes, Nov 18 '22 at 13:34
With an 8-bit sequence number, you could just use `atomic`. It's only 32-bit so the whole struct can be a lock-free atomic. Or with no sequence number at all, since tearing would be impossible; anything you read pointed-to by `index` will be correct. So you might just have the writer update the same location repeatedly with no index, with `relaxed`, since the reader can see everything it needs in one load. Unless that's not true, then yeah acq/rel. — Peter Cordes, Nov 18 '22 at 13:37
If you do use an array, it should be tiny, like maybe 4 to 16 entries. Maybe spread over a couple cache lines to the reader disturbs the writer slightly less, but if it's as infrequent as every 1 ms, atomically writing 4 or 8 bytes to a single object with no other tracking is very very good, cheap for the writer as it stays hot in cache and doesn't pollute much space (unlike looping over a much larger buffer, so only do that if that history of operations is sometimes useful to grab a history trace.) — Peter Cordes, Nov 18 '22 at 15:11
Even 16-byte structs can be written atomically cheaply on Intel CPUs with AVX, but compilers probably haven't caught up with that recently-documented guarantee. — Peter Cordes, Nov 18 '22 at 15:11
wow @PeterCordes this is great material. I'm almost tempted to rewrite this question, such that you can answer it with your design and get the much deserved credits. Also in reality I need to store around 18 bytes, so sadly not small enough to fit in a atomic, but I'm going to try your suggestion with the seqlock (and maybe bench with just using a seqlock) as I'm only interested in the latest point (and indeed, sometimes if the evaluator is slow, I prefer reading that point multiple times). — Davy Landman, Nov 18 '22 at 19:15
You could ask a separate question that's about your actual problem, not the conceptual question you're asking here. If writes happen very frequently, a single SeqLock is maybe not ideal; the reader might have to retry a few times if the value is frequently in the middle of being updated. But I guess "frequent" would have to be measured against how long it takes to commit a few stores to a cache line, not how frequently the reader actually reads, so it's likely fine. And if you're not interested in sometimes taking the 2nd-newest, probably no benefit to a ring buffer. — Peter Cordes, Nov 19 '22 at 02:20
hi @PeterCordes check this new question: https://stackoverflow.com/questions/74524246/how-to-reduce-the-writer-side-performance-penalty-for-sharing-a-small-structur/74524285#74524285 — Davy Landman, Nov 21 '22 at 20:00

Do you need to pair memory_order_acquire and memory_order_release around a block of code?

0 Answers0

Linked