Is memory barrier after lock acquire necessary?

Question

I have system with 3 CPUs sharing memory and bus. There is no cache in the system there are however store buffers. CPUs have Compare And Swap (CAS) instruction available that is atomic. Additionally there are LoadMemoryBarrier() instruction that ensures loads cannot be reordered across the barrier StoreMemoryBarrier() ensures that no stores can be reordered across the barrier and GeneralMemoryBarrier() that ensures no memory access can be reordered across the barrier.

CPUs can reorder stores against stores or loads, loads against stores or loads and either against atomic operations. Basically they can reorder memory access any way they please.

unsigned int spinlock = 0;
int x = 0;
void lock()
{
    while (CAS(&spinlock, 1, 0) == 1); // Wait to acquire lock
    // Spinlock acquired
    LoadMemoryBarrier();
}

void unlock()
{
    GeneralMemoryBarrier();
    spinlock = 0;
}

void CPU1()
{
    lock();
    x += 1;
    unlock();
}

void CPU2()
{
    lock();
    x -= 1;
    unlock();
}

void CPU3()
{
    printf("%d", x);
}

Can store to x be ordered to happen before spinlock is acquired? My understanding is that no, it cannot be and so I do not see any case where we would need more then LoadMemoryBarrier() inside of lock() function. Is my understanding correct and can you provide an example of where this barrier is not enough of a guarantee?

EDIT:

I am using C99 and do not have access to stdatomic. Assume CAS does not include any implicit barriers.

Is this actually an assembly question in disguise as C? If not, how are those C functions defined? Is `CAS` `atomic_compare_exchange_weak` from `stdatomic.h`? (in which case its default memory_order is `memory_order_seq_cst`, so is itself an acquire operation.) Or GNU C `__atomic_compare_exchange_n` with some memory order? https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html - that can do atomic RMWs on plain `int` variables. — Peter Cordes, Aug 21 '23 at 19:50
But I think this looks more like an assembly question since you give the ISA's memory-ordering rules. You say loads can reorder with atomic RMWs, so at least the load side of `x += 1` could happen before the CAS. But critically it can't reorder stores with atomic RMWs, so it looks like a LoadLoad barrier after CAS is sufficient and necessary for acquire semantics since LoadStore is already blocked (of the CAS's load with later stores): https://preshing.com/20120913/acquire-and-release-semantics/ — Peter Cordes, Aug 21 '23 at 19:53
What are you asking here? Are you asking if another `LoadMemoryBarrier()` would be needed in the `CPU1()` function after the `lock()` call, which already did one? Or are you asking whether a full barrier is needed? I don't think so, because of atomic RMWs having some extra ordering on this ISA that plain load and plain store don't have. — Peter Cordes, Aug 21 '23 at 20:00
Also keep in mind that C `x += 1` runs as separate load, add, and store operations, that's why parts of it can reorder with the CAS. It's not an atomic RMW, otherwise you wouldn't need the lock in the first place. (`__atomic_fetch_add_n(&x, 1, __ATOMIC_RELAXED)` would be more efficient, too.) — Peter Cordes, Aug 21 '23 at 21:02
@PeterCordes I am asking if `LoadMemoryBarrier()` is enough of a guarantee in `lock()` or if full barrier is needed? Also I am wondering how would the critical section need to be changed (or code in general) where `LoadMemoryBarrier()` would NOT be enough of a guarantee and would need full barrier? — user1806687, Aug 22 '23 at 03:59
@PeterCordes Consider `CAS` not to include any implicit barriers it simply emits the appropriate assembly instruction. — user1806687, Aug 22 '23 at 04:14
I earlier misread the sentence about the CPU's memory model. It says it *can* reorder either (loads or stores) against atomic operations (presumably meaning atomic RMWs). So it's like ARM, you need to replace the LoadLoad barrier with `GeneralMemoryBarrier()` to make your lock safe for the general case (to also block LoadStore reordering between the load half of the RMW and any potential pure store). In this specific case, though, the store side of `x += 1` would have a data dependency on the load, and the barrier is enough to give LoadLoad ordering. — Peter Cordes, Aug 22 '23 at 04:14
What's the point of calling this C99? Your question entirely depends on asm for a specific (hypothetical) CPU, and no compile-time reordering. This might as well be pseudo-code; people that know about C aren't going to give you the answers you're looking for based on their knowledge of C. C99 doesn't have a memory model that says anything about multiple threads, so multithreading with C99 was all just a de-facto / works-in-practice situation dependent on real compilers, and inline asm or intrinsics + `volatile`. — Peter Cordes, Aug 22 '23 at 04:19
@PeterCordes Can you come up with the case of where LoadLoad barrier would not be enough and would need GeneralMemoryBarrier()? Then you can change this into an answer and I will accept. — user1806687, Aug 22 '23 at 04:24
@PeterCordes I've state this is C99 to make it known that I do not have access to stdatomic — user1806687, Aug 22 '23 at 04:25
Yes, like I said, a pure store. With `lock(); x = 3; unlock();` , the store could become visible before this thread acquires the lock, because your LoadMemoryBarrier() only blocks LoadLoad reordering, not LoadStore. Another thread could see `x`'s value change while it still owned the lock. — Peter Cordes, Aug 22 '23 at 04:26

score 1 · Answer 1 · answered Aug 22 '23 at 02:08

You really ought to just use stdatomic.h primitives. On arches that need them (e.g. arm), they insert cache flush instructions (if needed).

But, even with the correct(ed) calls, your lock/unlock code is semi-broken.

You're trying to do the equivalent of a "ticket lock". See: https://en.wikipedia.org/wiki/Ticket_lock

But, with more than two threads, your implementation has a race condition.

With three (or more) threads, when a thread does unlock, the other two threads doing lock will race to get it.

One thread could monopolize the lock and the other threads might never get it.

To guarantee fairness and progress, instead of a single variable spinlock, using two variables (e.g. spin_request and spin_grant) will guarantee that progress can be made.

unsigned int spin_request = 0;
unsigned int spin_grant = 1;

void
lock(void)
{
    unsigned int request = atomic_fetch_add(&spin_request,1) + 1;
    unsigned int grant;

    while (1) {
        grant = atomic_load(&spin_grant);
        if (grant == request)
            break;
    }
}

void
unlock(void)
{

    atomic_fetch_add(&spin_grant,1);
}

As Peter noted, if all you're doing is incrementing/decrementing x, then doing:

x += 1 --> atomic_fetch_add(&x,1)
x -= 1 --> atomic_fetch_sub(&x,1)

would probably be better.

Also note that, some arches (e.g. x86) have single instructions for doing the above (e.g. lock addx). But, for those that don't (e.g. arm), the functions are implemented with the correct (and best) CAS-like instructions.

The stdatomic.h primitives will choose the best way(s) to do things (some newer arm arches have special load/store with acquire/release modifiers).

So, once again, unless you want to recreate all that knowledge, it's better to use the standardized compiler primitives.

*On arches that need them (e.g. arm), they insert cache flush instructions (if needed).* - real-world CPU architectures have coherent caches. They need to *order* their accesses to their own cache (with barrier instructions like `dmb ish`), but they don't need to *flush* or invalidate it explicitly to implement `std::atomic`'s ordering rules. Our copy of a cache line will be invalidated by another core before it can commit a write to its L1d. See https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ for a good analogy for that model of memory ordering. — Peter Cordes, Aug 22 '23 at 02:16
I do not have access to stdatomic neither to any synchronization primitive other then what was explained in the post. I am trying to implement spinlock on this architecture. Assume that CPUs acquiring the lock are doing so rarely and will never monopolize the lock. — user1806687, Aug 22 '23 at 04:02
@user1806687 `stdatomic.h` et. al. is part of the C standard and is implemented by the compiler and _not_ (e.g.) `libc`, etc. So, what is the arch/ISA/platform and what is the C compiler in question? Not what it conforms to (e.g. C99) but the actual compiler (e.g. `gcc`, `clang`, MS, `icc`). And, why do you want to talk about/use multicore primitives but restrict yourself to an obsolete C compiler/standard? I was doing this (and did exactly what you're trying to do--roll your own) 18 years ago (before it was standardized in C). But, again, what is the arch/ISA? — Craig Estey, Aug 22 '23 at 07:14

score 1 · Accepted Answer · answered Aug 22 '23 at 04:56

So your C is basically just pseudo-code for asm for this ISA, and we need to assume no compile-time reordering. (Or that these special barrier functions allow as much compile-time reordering as they allow at run-time?)

Your memory model is pretty similar to 32-bit ARM: anything can reorder, and the choice of non-full barriers includes LoadLoad and StoreStore, but no way to block LoadStore reordering with anything weaker than a full barrier (which is expensive because it has to drain the store buffer before any later loads or stores, to block StoreLoad reordering).

In the general case, that means your lock() function needs to upgrade its barrier to GeneralMemoryBarrier() to get acquire semantics; blocking only LoadLoad is not sufficient. You also need to block LoadStore - see https://preshing.com/20120913/acquire-and-release-semantics/. If a thread did lock(); x = 3; unlock();, the store could become visible before this thread acquires the lock. Another thread could see x's value change while it still owned the lock. This is easy to see if we consider compile-time reordering: the store moves up past the LoadLoad barrier, and up past the CAS loop.

It might not be plausible for a real CPU to actually reorder a store ahead of an atomic RMW retry loop; it can't commit stores to memory (or coherent cache if you had any) unless they're known to be non-speculative. But just considering the memory-ordering rules that a CAS can reorder in either direction with a store, this is allowed.

Your unlock() is fine if this were asm. Since you keep claiming it's real C99, an assignment to a non-volatile variable can be problematic. Probably fine in this case, but see Who's afraid of a big bad optimizing compiler? for a lot of the many well-known and more subtle things that can go wrong when relying on just barriers without volatile to force the compiler to emit instructions that access memory.

In your specific case, with `x += 1` in the critical section

In this case the store side of the x += 1 is dependent on a load. Even with something like value-prediction to make a speculative result ready before the load is allowed to happen, a CPU can't let mis-speculated values become visible to other cores (via memory), or they'd also have to get rolled back on mis-speculation.

Real-world CPUs basically have no choice but to do dependency ordering, that a store whose value depends on an earlier load can't become visible before that load produces a value. (DEC Alpha famously violated causality with a load whose address depended on an earlier load, via some weird cache partitioning effect on a few models.)

If we can rely on the data dependency to ensure that the store half of x += 1 isn't visible until after the load half, a LoadLoad barrier after CAS is sufficient to contain this critical section.

CAS spinlock     (load and store.)
   LoadLoad barrier
 load x     // can't reorder earlier than either barrier
 store x    // can't happen earlier than load x

   Full barrier
store spinlock

The store half of the CAS (atomic RMW) can reorder later, but it's still an atomic RMW. The only necessary thing is to keep the critical section ops after the load side of the atomic RMW that takes the lock. (In C11, taking a mutex has only acquire semantics.)

Is memory barrier after lock acquire necessary?

2 Answers2

In your specific case, with x += 1 in the critical section

In your specific case, with `x += 1` in the critical section