GCC's __sync
builtins are obsoleted by its __atomic
builtins.
Both reader and writer should be using __atomic
operations, like __atomic_load_n
and __atomic_store_n
(you don't need or want an expensive atomic RMW since there's only one writer.) With __ATOMIC_RELAXED
, load and store are as cheap as plain operations (which can't get optimized away into registers.) Normal accesses (not via __atomic
builtins) to a plain non-volatile
non-_Atomic
are never ok with concurrent reads + writes, and will break in practice, e.g. hoisting a load out of a spin-wait loop.
But in new code, you should normally use C11 stdatomic.h
with _Atomic
types (https://en.cppreference.com/w/c/thread), and the macros available for testing if a certain atomic type is lock-free on the target you're compiling for.
(With the old __sync
builtins, I think the design intent was that you'd use volatile
for pure-load and pure-store, since they didn't provide __sync
builtins for that. Hand-rolling your own atomics with volatile
works on normal ISAs with normal compilers like GCC and clang, and the Linux kernel depends on it, but it's not recommended when you can get the job done with C11 stdatomic.h
.)
Lock-free 64-bit atomics
On systems where #if ATOMIC_LLONG_LOCK_FREE > 0
is true, use relaxed (or acq/rel) atomics, depending on what ordering guarantees you need (like if the writer is using the counter to "publish" other data to readers, e.g. if the counter is used as an index into a non-atomic array).
#include <stdatomic.h>
#include <stdint.h>
#if ATOMIC_LLONG_LOCK_FREE > 0
static atomic_uint_fast64_t counter;
// make sure this can inline into readers
uint64_t read_counter() {
return atomic_load_explicit(&counter, memory_order_relaxed); // or m_o_acquire
}
void increment_counter_single_writer() { // one thread only
uint64_t tmp = atomic_load_explicit(&counter, memory_order_relaxed);
// or keep a local copy of the counter in a register and *just* store.
// other threads just see the values we store, doesn't matter how we get them
atomic_store_explicit(&counter, tmp+1, memory_order_relaxed); // or m_o_release
}
// with multiple writers, use atomic_fetch_add_explicit
#else
static uint64_t counter;
static _Atomic unsigned seq;
uint64_t read_counter() { ... }
#endif
Note that some 32-bit systems can do lock-free 64-bit atomics, such as x86 (since P5 Pentium) and some ARM32. See how this compiles on Godbolt with clang for x86 and ARM Cortex-A8 (to pick a random ARM that's not recent).
Otherwise probably a SeqLock
Otherwise, without lock-free 64-bit atomics, use a SeqLock if the counter doesn't increment too often. (See Implementing 64 bit atomic counter with 32 bit atomics). This still lets the readers be truly read-only so they don't contend with each other for cache lines, they only have to retry if they tried to read while the writer was in the middle of an update. (After the writer is done, the cache line containing the counter can be in Shared state on all cores running reader threads, so they can all get hits in cache.)
For a monotonic counter, the halves of the counter itself can work as a sequence number to detect tearing. Like read the low half before and after reading the high half, retry if different.
A readers/writers lock would force readers to contend with each other to modify the cache line holding the lock, so the total throughput doesn't scale with number of readers. If you have a very-frequently modified counter (so a seqlock would often be in an inconsistent state), you might consider something more clever, like a queue of recent values so readers could check the most recent consistent value or something?
BTW, ATOMIC_LLONG_LOCK_FREE > 0
seems appropriate: ATOMIC_LLONG_LOCK_FREE == 1
means "sometimes lock-free", but in practice there aren't implementations where some objects are lock-free and some aren't. And if there were, we'd hope that a compiler could arrange for a loose global/static variable to be aligned such that it's atomic.