relaxed
is sufficient.
Every counter.fetch_add(1, relaxed)
will return a different value, and every value from 0
upwards will be returned once. Atomicity alone guarantees this since all operations are on the same atomic object.
There's no guarantee about which thread will get which i
value, and the operations on data[i]
aren't ordered, but that's fine because exactly one thread will access each data[i]
until thread.join()
creates synchronization between the writer and reader.
The relevant wording in the C++ standard is the part saying an atomic RMW must see the "latest value" and that a consistent modification-order exists for each atomic object separately. And that within a single thread, operations on the same atomic object respect the sequenced-before program order. (i.e. one fetch_add
can't "reorder" with another fetch_add
.)
In hardware, the relevant behaviours are atomicity of the RMW, and cache coherency which guarantees that all other copies of the cache line must be invalidated before an atomic RMW can do its load+add+store. (That's what makes atomic RMWs on the same address serializable.)
If not relying on .join()
for synchronization with the reader
Otherwise you'd need a separate atomic<bool> done
flag, or release
RMWs if some other thread was just looking at counter
. The waiting thread would have to wait for counter == size + nthreads
to be sure that every thread has done a release
operation sequenced-after it last data[i] *= 2
. (And these form a release-sequence so the reader would synchronize-with all the threads.)
Each thread will stop doing increments after it sees a false fetch_add() < size
. The first thread (in modification-order of counter
) to see that happen have loaded i=size
and stored i=size+1
. The second thread to leave the loop will have loaded size+1
and stored size+2
. So with nthreads=1 or 2, counter==size+nthreads
after they've done their final store as part of the RMW.
So a load that sees counter == size+nthreads
can be sure that all threads have done a fetch_add
after their last data[i] *= 2;
. If those were release
fetch_add and this was an acquire
load, the stores to the data[0..size-1]
objects are guaranteed to have happened-before this load, so later code in this thread can see the modified values.
(You can only check that all threads are finished; before that it's not guaranteed that data[0..counter-1]
are finished writing or anything like that. Even with seq_cst
increments, you could have a thread claim an i
value but stall or get scheduled out for a long time before accessing that data[i]
. So any array element could still be being modified.)
Could a compiler batch this for you into fetch_add(128, relaxed)
... and loop over the i
values? Hypothetically yes, but practically no. See Why don't compilers merge redundant std::atomic writes? - compilers don't optimize atomics at all, except for reordering non-atomic operations across them when allowed.
Hypothetically, a compiler could even statically decide to compile it so it always runs with one thread doing all the increments and the other thread doing none. (Except the final one to exit the loop). e.g. by doing a += 1 mil
first and then looping over those i
values. This isn't a correctness problem, and would be pretty insane for a compiler to actually do. It's not something you need to rule out.
Even if you used seq_cst
, a sufficiently-smart Deathstation 9000 compiler could analyze the whole program and see that nothing synchronizes-with the values of counter
, so it could still make the same transformation. If anything observed the final value of counter
, it would also have to make sure counter = size + nthreads
so it couldn't just do fetch_add(1'000'000)
in both threads.
You want to manually use batches larger than 1
A batch size like 4096 bytes or ints could make sense. Claiming 4096 elements to work on and looping over i + 0..4095
allows that inner loop to use SIMD, and have plenty of work in-flight before the next slow atomic RMW that has to wait for that counter
cache-line to bounce between threads.
Having only one thread access a cache line containing a range of data[]
values avoids bouncing those cache lines around. So one 64-byte cache line is a minimum batch size you might want to use if data[]
is aligned, otherwise you need larger batches so there's only splits at the ends of batches. But you want larger batches anyway since 64 bytes is only a couple or a few SIMD loads/stores.
If data[]
is page-aligned, then page-sized blocks mean only one thread needs a dTLB miss for that page, and hardware prefetching on modern x86 CPUs stops at page boundaries (since contiguous virtual pages might not be physically contiguous.) So it's a natural place to end a batch, with HW prefetch not creating contention for the thread working on the next chunk.
Especially on x86 where every atomic RMW is effectively seq_cst
and a full memory barrier in asm, you want to make your batch size plenty large to amortize those costs. As large as possible without leaving a lot of leftover work at the end if one thread stalls or something.