Memory ordering for a spin-lock "call once" implementation

Question

Suppose I wanted to implement a mechanism for calling a piece of code exactly once (e.g. for initialization purposes), even when multiple threads hit the call site repeatedly. Basically, I'm trying to implement something like pthread_once, but with GCC atomics and spin-locking. I have a candidate implementation below, but I'd like to know if

a) it could be faster in the common case (i.e. already initialized), and,

b) is the selected memory ordering strong enough / too strong?

Architectures of interest are x86_64 (primarily) and aarch64.

The intended use API is something like this

void gets_called_many_times_from_many_threads(void)
{  
    static int my_once_flag = 0;
     
    if (once_enter(&my_once_flag)) {
        // do one-time initialization here
        once_commit(&my_once_flag);
    }

    // do other things that assume the initialization has taken place
}

And here is the implementation:

int once_enter(int *b)
{
    int zero = 0;
    int got_lock = __atomic_compare_exchange_n(b, &zero, 1, 0, __ATOMIC_RELAXED, __ATOMIC_RELAXED);
    if (got_lock) return 1;

    while (2 != __atomic_load_n(b, __ATOMIC_ACQUIRE)) {
        // on x86, insert a pause instruction here
    };
    return 0;
}

void once_commit(int *b)
{
    (void) __atomic_store_n(b, 2, __ATOMIC_RELEASE);
}

I think that the RELAXED ordering on the compare exchange is okay, because we don't skip the atomic load in the while condition even if the compare-exchange gives us 2 (in the "zero" variable), so the ACQUIRE on that load synchronizes with the RELEASE in once_commit (I think), but maybe on a successful compare-exchange we need to use RELEASE? I'm unclear here.

Also, I just learned that lock cmpxchg is a full memory barrier on x86, and since we are hitting the __atomic_compare_exchange_n in the common case (initialization has already been done), that barrier it is occurring on every function call. Is there an easy way to avoid this?

UPDATE

Based on the comments and accepted answer, I've come up with the following modified implementation. If anybody spots a bug please let me know, but I believe it's correct. Basically, the change amounts to implementing double-check locking. I also switched to using SEQ_CST because:

I mainly care that the common (already initialized) case is fast.
I observed that GCC doesn't emit a memory fence instruction on x86 for the first read (and it does do so on ARM even with ACQUIRE).

#ifdef __x86_64__
#define PAUSE()  __asm __volatile("pause")
#else
#define PAUSE()
#endif

int once_enter(int *b)
{
    if(2 == __atomic_load_n(b, __ATOMIC_SEQ_CST)) return 0;

    int zero = 0;
    int got_lock = __atomic_compare_exchange_n(b, &zero, 1, 0, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
    if (got_lock) return 1;

    while (2 != __atomic_load_n(b, __ATOMIC_SEQ_CST)) {
        PAUSE();
    };
    return 0;
}

void once_commit(int *b)
{
    (void) __atomic_store_n(b, 2, __ATOMIC_SEQ_CST);
}

Generally an atomic RMW is still the most efficient way to get mutual exclusion, even though it costs a full memory barrier on x86. Alternatives include Dekker's algorithm, but that's usually even slower and actually requires a full memory barrier (https://en.wikipedia.org/wiki/Dekker%27s_algorithm). — Peter Cordes, Dec 16 '22 at 03:00
Oh, but yeah in this case, like a guard variable for a static initializer, you only need mutual exclusion after finding out that the call_once hasn't happened yet. The fast-path should be a read-only check. My previous comment was mostly just replying to the sentence about lock cmpxchg being a full barrier; you definitely want a read-only check even on ISAs that can do relaxed or acquire RMWs in asm, unless they bail out before actually invalidating the cache line on other cores. — Peter Cordes, Dec 16 '22 at 09:32
Is there a particular reason you're using GCC builtin atomics instead of the equally capable, and much more portable, C11 atomics from ``? — Nate Eldredge, Dec 16 '22 at 21:22
For AArch64, you might want to switch to a different primitive than `compare_exchange`. ARMv8.1 has a suite of single-instruction atomic read-modify-write operations, including set/clear/toggle bits, and integer add, min and max. But not compare-exchange, so that has to fall back to a `ldxr/stxr` loop as on ARMv8.0, whose forward progress is less certain. — Nate Eldredge, Dec 16 '22 at 21:31
One possible alternative would be to have a pair of boolean flags, one to indicate whether initialization has started, and the other to indicate whether it has finished. So then if the `finished` flag is unset, you atomically test-and-set the `started` flag, and continue with the initialization if you see that it was previously false; otherwise you spin until `finished` becomes true. Atomic test-and-set is a single instruction on ARMv8.1 (`LDSETB` and its acquire/release variants). For x86 and ARMv8.0, test-and-set is about the same as compare-exchange, performance-wise. — Nate Eldredge, Dec 16 '22 at 21:34
Not that it makes too much difference here, since once you are using a double-checked lock, the atomic read-modify-write is out of the fast path anyway. But maybe worth noting in general. — Nate Eldredge, Dec 16 '22 at 21:35
You do want `acquire` not `seq_cst` for the fast-path load. That's cheaper on AArch64 (`ldapr` instead of `ldar`) and PowerPC, among others. — Peter Cordes, Dec 17 '22 at 20:39

score 4 · Accepted Answer · answered Dec 16 '22 at 08:37

a, What you need is a double-checked lock.

Basically, instead of entering the lock every time, you do an acquiring-load to see if the initialisation has been done yet, and only invoke once_enter if it has not.

void gets_called_many_times_from_many_threads(void)
{  
    static int my_once_flag = 0;

    if (__atomic_load_n(&my_once_flag, __ATOMIC_ACQUIRE) != 2) {
        if (once_enter(&my_once_flag)) {
            // do one-time initialization here
            once_commit(&my_once_flag);
        }
    }

    // do other things that assume the initialization has taken place
}

b, I believe this is enough, your initialisation happens before the releasing store of 2 to my_once_flag, and every other thread has to observe the value of 2 with an acquiring load from the same variable.

Memory ordering for a spin-lock "call once" implementation

1 Answers1