What is the right std::atomic memory order for dynamic scheduling?

Question

I am wondering what the correct memory order is for the classic "atomic counter dynamic scheduling" idiom. That is:

use fetch-add to get the index i of the next element to process
if i is past the end of the array, terminate
process element_i thread-safely, since no other thread can have i
go to 1.

For example:

#include <atomic>

std::atomic_int counter = 0;

void foo(int *data, int size) {
    // we could also write counter++
    for (int i; (i = counter.fetch_add(1, std::memory_order::seq_cst)) < size;) {
        data[i] *= 2;
    }
}

// driver code
#include <thread>
#include <numeric>
#include <cassert>

int main() {
    int data[1'000'000];
    std::iota(std::begin(data), std::end(data), 0);

    std::thread other{foo, data, std::size(data)};
    foo(data, std::size(data));
    other.join();

    for (int i = 0; i < std::size(data); ++i) {
        assert(data[i] == i * 2);
    }
}

This code works, and it should be safe, because processing an element cannot be reordered before or after getting the next index, and all fetch-adds are observed in a consistent total order by all threads. These requirements seem overly strict to me, and I believe we could use a more relaxed ordering.

std::memory_order_relaxed and std::memory_order::acquire are too relaxed I believe, because all threads observe counter = 0; initially, and we have to ensure that data[0] *= 2 is not moved before the first fetch_add. These two memory orders would allow that.

The answer has to be one of:

std::memory_order::seq_cst
std::memory_order::acq_rel
std::memory-order::release

Which one is the correct order in this situation?

`relaxed` is okay here because within the loop, no thread depends on any of the non-atomic stores of the other threads. They only care about that after the `join` and that acts as a synchronization point anyway. — Homer512, Jun 15 '23 at 13:26
"all threads observe `counter = 0;` initially." No. Only one will observe that. The other will see the result of the first `fetch_add` since it is atomic, even if it is relaxed atomic — Homer512, Jun 15 '23 at 13:29
@Homer512 what about the very first iteration though? We can't allow `data[0] *= 2` to happen before `i = counter++`, where `counter == 0`, because then we have a race condition at the first element. — Jan Schultke, Jun 15 '23 at 13:31
That's unrelated to multithreaded programming. If you have written in the code first `i=counter++;` and then `data[i]*=2`, then these operations may not be reordered. If this would be allowed, we wouldn't be able to write programs at all. — Daniel Langr, Jun 15 '23 at 13:35
I don't see where you think you have a race condition. Your `fetch_add` is atomic. Its return value, which you assign to `i` will be 0 for exactly one of the threads. That's the only one that will access `data[0]`. The only other access to `data[0]` happens after the `join` — Homer512, Jun 15 '23 at 13:39
When an atomic counter is initialized to 0 and then 2 threads perform fetch-add on it, then it is simply not possible both threads would fetch 0. This is unrelated to memory ordering issues. — Daniel Langr, Jun 15 '23 at 13:43
@Homer512 I was under the impression that other threads know the initial value of `counter`, given that it's initialized to a constant, and there is no code in `main` that modifies `counter`. But if they all first have to go through a `fetch_add`, then this should be safe. However, I think `relaxed` would still allow all atomic fetch-adds to happen first and be coalescend into `+= 1'000'000`, and then have all non-atomic RMW operations to happen single-threaded. — Jan Schultke, Jun 15 '23 at 13:47
Re. initialization of `counter`: constructors of atomics are non-atomic. But this happens before you start the second thread so there is a clear happens-before relation. Any code that reuses counters would almost naturally have synchronization points around it like `join` or some form of `wait`. So relaxed stores to reset counters is usually fine. It's not part of the code you showed anyway — Homer512, Jun 15 '23 at 13:53
"allow all atomic fetch-adds to happen first and be coalescend into `+= 1'000'000`". No, the program still needs to execute the loop body. Of course in this case a CPU could do out-of-order execution and do (in theory) multiple `fetch_add` cycles in parallel to processing the `data[i] *=2 ` operations. But that is something you *want* to happen since it just makes better use of CPU resources. Of course an x86 wouldn't do it anyway because `fetch_add` is always sequentially consistent and normal memory operations are always acquire and release but that is besides the point — Homer512, Jun 15 '23 at 13:57
`relaxed` is sufficient. The relevant wording in the standard is the part saying an atomic RMW must see the "latest value", and that within a single thread operations on the same atomic object respect the sequenced-before program order. In hardware, the relevant behaviours are atomicity of the RMW, and cache coherency which guarantees that all other copies of the cache line must be invalidated before an atomic RMW can do its load+add+store. (That's what makes atomic RMWs on the same address serializable.) — Peter Cordes, Jun 15 '23 at 17:34

Peter Cordes · Accepted Answer · 2023-06-15T19:23:04.260

`relaxed` is sufficient.

Every counter.fetch_add(1, relaxed) will return a different value, and every value from 0 upwards will be returned once. Atomicity alone guarantees this since all operations are on the same atomic object.

There's no guarantee about which thread will get which i value, and the operations on data[i] aren't ordered, but that's fine because exactly one thread will access each data[i] until thread.join() creates synchronization between the writer and reader.

The relevant wording in the C++ standard is the part saying an atomic RMW must see the "latest value" and that a consistent modification-order exists for each atomic object separately. And that within a single thread, operations on the same atomic object respect the sequenced-before program order. (i.e. one fetch_add can't "reorder" with another fetch_add.)

In hardware, the relevant behaviours are atomicity of the RMW, and cache coherency which guarantees that all other copies of the cache line must be invalidated before an atomic RMW can do its load+add+store. (That's what makes atomic RMWs on the same address serializable.)

If not relying on `.join()` for synchronization with the reader

Otherwise you'd need a separate atomic<bool> done flag, or release RMWs if some other thread was just looking at counter. The waiting thread would have to wait for counter == size + nthreads to be sure that every thread has done a release operation sequenced-after it last data[i] *= 2. (And these form a release-sequence so the reader would synchronize-with all the threads.)

Each thread will stop doing increments after it sees a false fetch_add() < size. The first thread (in modification-order of counter) to see that happen have loaded i=size and stored i=size+1. The second thread to leave the loop will have loaded size+1 and stored size+2. So with nthreads=1 or 2, counter==size+nthreads after they've done their final store as part of the RMW.

So a load that sees counter == size+nthreads can be sure that all threads have done a fetch_add after their last data[i] *= 2;. If those were release fetch_add and this was an acquire load, the stores to the data[0..size-1] objects are guaranteed to have happened-before this load, so later code in this thread can see the modified values.

(You can only check that all threads are finished; before that it's not guaranteed that data[0..counter-1] are finished writing or anything like that. Even with seq_cst increments, you could have a thread claim an i value but stall or get scheduled out for a long time before accessing that data[i]. So any array element could still be being modified.)

Could a compiler batch this for you into `fetch_add(128, relaxed)`

... and loop over the i values? Hypothetically yes, but practically no. See Why don't compilers merge redundant std::atomic writes? - compilers don't optimize atomics at all, except for reordering non-atomic operations across them when allowed.

Hypothetically, a compiler could even statically decide to compile it so it always runs with one thread doing all the increments and the other thread doing none. (Except the final one to exit the loop). e.g. by doing a += 1 mil first and then looping over those i values. This isn't a correctness problem, and would be pretty insane for a compiler to actually do. It's not something you need to rule out.

Even if you used seq_cst, a sufficiently-smart Deathstation 9000 compiler could analyze the whole program and see that nothing synchronizes-with the values of counter, so it could still make the same transformation. If anything observed the final value of counter, it would also have to make sure counter = size + nthreads so it couldn't just do fetch_add(1'000'000) in both threads.

You want to manually use batches larger than 1

A batch size like 4096 bytes or ints could make sense. Claiming 4096 elements to work on and looping over i + 0..4095 allows that inner loop to use SIMD, and have plenty of work in-flight before the next slow atomic RMW that has to wait for that counter cache-line to bounce between threads.

Having only one thread access a cache line containing a range of data[] values avoids bouncing those cache lines around. So one 64-byte cache line is a minimum batch size you might want to use if data[] is aligned, otherwise you need larger batches so there's only splits at the ends of batches. But you want larger batches anyway since 64 bytes is only a couple or a few SIMD loads/stores.

If data[] is page-aligned, then page-sized blocks mean only one thread needs a dTLB miss for that page, and hardware prefetching on modern x86 CPUs stops at page boundaries (since contiguous virtual pages might not be physically contiguous.) So it's a natural place to end a batch, with HW prefetch not creating contention for the thread working on the next chunk.

Especially on x86 where every atomic RMW is effectively seq_cst and a full memory barrier in asm, you want to make your batch size plenty large to amortize those costs. As large as possible without leaving a lot of leftover work at the end if one thread stalls or something.

With `relaxed`, it's still permitted to reorder (relative to non-atomic ops) and coalesce multiple atomic fetch-increments into a single fetch-add, right? So the compiler could still output code that does `counter += 1'000'000` and then single-threadedly processes everything on its own. Wouldn't `acq_rel` be better because `data[i] *= 2` needs to be "sandwiched" between two atomic fetch-adds? — Jan Schultke, Jun 15 '23 at 18:11
Also, you can theoretically get stale values, but doesn't cache coherency on e.g. x86_64 prevent that in practice? — Jan Schultke, Jun 15 '23 at 18:17
@JanSchultke: *coalesce multiple atomic fetch-increments into a single fetch-add* No, not when that would violate the as-if rule. Your program *does* care if there's 1 or 1M `fetch_add`, so a compiler couldn't collapse a fetch_add loop into a single larger fetch_add. — Peter Cordes, Jun 15 '23 at 18:21
@JanSchultke: Also no, `data[i] *= 2` operations don't need to be ordered wrt. anything except the thread ending, or as I discussed in the last section of my answer, with a final `done.store(true, release)` after the loop or a `release` RMW on the counter if another thread was spin-waiting on that instead of `.join()`. The worker threads definitely don't need `acquire` to have their later operations happen-after something in another thread. They're each claiming *separate* `data[i]` elements. — Peter Cordes, Jun 15 '23 at 18:21
@JanSchultke: Stale values for what? Atomic RMWs are guaranteed in ISO C++ to [see the "latest value"](https://stackoverflow.com/questions/53032354/does-atomic-read-guarantees-reading-of-the-latest-value) in the modification order of `counter`. [All mainstream CPU architectures have coherent caches between all cores that C++ threads can run across](https://stackoverflow.com/questions/4557979/when-to-use-volatile-with-multi-threading/58535118#58535118), not just x86-64, and yes that's the HW mechanism that implements the C++ standard's requirement. — Peter Cordes, Jun 15 '23 at 18:27
It wouldn't violate the as-if rule if we did one million fetch-adds, stored the results `i_0 .. i_999999` in memory (with no side effects), and then later did `data[i] *= 2` for each stored index. Nothing that we do involves volatile operations. If we can do that, why can't we also coalesce these one million fetch-increments so that they all happen simultaneously? — Jan Schultke, Jun 15 '23 at 18:27
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/254094/discussion-between-jan-schultke-and-peter-cordes). — Jan Schultke, Jun 15 '23 at 18:28

score 3 · Answer 2 · answered Jun 15 '23 at 18:04

relaxed is ok, because regardless of memory order, each separate atomic variable is always consistent across all threads.

Memory orders are only necessary when you have more than one variable (atomic or not) shared across threads, because with relaxed, threads can disagree how operations on different variables are interleaved, and stronger orders fix that.

xryl669 · Answer 3 · 2023-06-15T17:59:29.250

I think you're confusing with a CAS (compare and swap) operation here.

fetch_add is not a CAS, it's defined as a Read Modify Write atomic operation. It's not equivalent to atomic_var.compare_exchange(i, i+1, relaxed). The whole operation is considered atomic and should be completely executed.

The memory order parameters is specified for the ordering of other atomic operation happening concurrently or sequentially. It gives the compiler some freedom to reorder the operations.

When using an atomic increment operation, even in relaxed mode, the order of execution will ensure that the n-th fetch_add of the atomic variable will return n-1. If it didn't do that, the code would break (and probably the universe too, but that's a minor inconvenient).

Said differently, you'd never observe, successively, for example, in Thread A, i = 0 and in Thread B, i = 2. One thread will observe i = 1 before any thread observes i = 2.

What's unknown is which thread will execute the n-th fetch on this variable. When you play with the memory order, you allow the compiler to reorder this on whatever sequence of operation it's allowed to.

Since you don't play with another variable here, even the relaxed order will work. If you were playing with another variable (say, for example, changing atomic variable tmp after fetching counter), then this would have an impact, since the order of the changed value of the variable tmp might not be the same in the other thread if you didn't restrict the ordering.

In that case, Thread B might see the previous tmp value even if code in Thread A changed tmp before counter.

If you set seq_cst on fetch_add, then this behavior can't happen.

I'm not confusing it with CAS. My question is about whether the reordering of non-atomic memory operations around the atomic increment can break my code, and what the least strict memory order is that I can use. — Jan Schultke, Jun 15 '23 at 17:59
RMW are atomic in one logical unit of operation. You can't have the counter fetching the same value in both thread, even in relaxed mode, unlike a CAS. — xryl669, Jun 15 '23 at 18:00
Yes, and I wasn't worried about that being possible. I was worried about orders such as `relaxed` giving the compiler too much optimization freedom, so that either I get a race condition when modifying `data[i]`, or the reordering making it unsuited for this counter idiom because atomic operations are being coalesced, or otherwise ordered unexpectedly. — Jan Schultke, Jun 15 '23 at 18:04
Ok. I think your example is too simple because there isn't any possible way to reorder anything your example. When writing to `data[i]`, `i` must be known so it can't happen before being fetched, so no reordering is possible here. In a more complex example, where the cause/consequences chain isn't obvious, you'd have to increase the constraint on the `fetch_add` I think. — xryl669, Jun 15 '23 at 18:08

What is the right std::atomic memory order for dynamic scheduling?

3 Answers3

relaxed is sufficient.

If not relying on .join() for synchronization with the reader

Could a compiler batch this for you into fetch_add(128, relaxed)

You want to manually use batches larger than 1

`relaxed` is sufficient.

If not relying on `.join()` for synchronization with the reader

Could a compiler batch this for you into `fetch_add(128, relaxed)`