DMB instructions in an interrupt-safe FIFO

Question

Related to this thread, I have a FIFO which should work across different interrupts on a Cortex M4.

The head index must be

atomically written (modified) by multiple interrupts (not threads)
atomically read by a single (lowest level) interrupt

The function for moving the FIFO head looks similar to this (there are also checks to see if the head overflowed in the actual code but this is the main idea):

#include <stdatomic.h>
#include <stdint.h>

#define FIFO_LEN 1024
extern _Atomic int32_t _head;

int32_t acquire_head(void)
{
    while (1)
    {
        int32_t old_h = atomic_load(&_head);
        int32_t new_h = (old_h + 1) & (FIFO_LEN - 1);

        if (atomic_compare_exchange_strong(&_head, &old_h, new_h))
        {
            return old_h;
        }
    }
}

GCC will compile this to:

acquire_head:
        ldr     r2, .L8
.L2:
        // int32_t old_h = atomic_load(&_head);
        dmb     ish
        ldr     r1, [r2]
        dmb     ish

        // int32_t new_h = (old_h + 1) & (FIFO_LEN - 1);
        adds    r3, r1, #1
        ubfx    r3, r3, #0, #10

        // if (atomic_compare_exchange_strong(&_head, &old_h, new_h))
        dmb     ish
.L5:
        ldrex   r0, [r2]
        cmp     r0, r1
        bne     .L6
        strex   ip, r3, [r2]
        cmp     ip, #0
        bne     .L5
.L6:
        dmb     ish
        bne     .L2
        bx      lr
.L8:
        .word   _head

This is a bare metal project without an OS/threads. This code is for a logging FIFO which is not time critical, but I don't want the acquiring of the head to make an impact on the latency of the rest of my program, so my question is:

do I need all these dmbs?
will there be a noticeable performance penalty with these instructions, or can I just ignore this?
if an interrupt happens during a dmb, how many additional cycles of latency does it create?

Not good enough, look here: https://stackoverflow.com/questions/40019929/temporarily-disable-interrupts-on-arm — Hans Passant, Jan 06 '19 at 13:53
@HansPassant: why not good enough? `ldrex`/`strex` are designed specifically to handle atomic CAS operations. Look here: https://stackoverflow.com/a/51797599/1488067. — Lou, Jan 06 '19 at 14:08

Peter Cordes · Accepted Answer · 2021-01-09T02:10:37.710

TL:DR yes, LL/SC (STREX/LDREX) can be good for interrupt latency compared to disabling interrupts, by making an atomic RMW interruptible with a retry.

This may come at the cost of throughput, because apparently disabling / re-enabling interrupts on ARMv7 is very cheap (like maybe 1 or 2 cycles each for cpsid if / cpsie if), especially if you can unconditionally enable interrupts instead of saving the old state. (Temporarily disable interrupts on ARM).

The extra throughput costs are: if LDREX/STREX are any slower than LDR / STR on Cortex-M4, a cmp/bne (not-taken in the successful case), and any time the loop has to retry the whole loop body runs again. (Retry should be very rare; only if an interrupt actually comes in while in the middle of an LL/SC in another interrupt handler.)

C11 compilers like gcc don't have a special-case mode for uniprocessor systems or single-threaded code, unfortunately. So they don't know how to do code-gen that takes advantage of the fact that anything running on the same core will see all our operations in program order up to a certain point, even without any barriers.

(The cardinal rule of out-of-order execution and memory reordering is that it preserves the illusion of a single-thread or single core running instructions in program order.)

The back-to-back dmb instructions separated only by a couple ALU instructions are redundant even on a multi-core system for multi-threaded code. This is a gcc missed-optimization, because current compilers do basically no optimization on atomics. (Better to be safe and slowish than to risk ever being too weak. It's hard enough to reason about, test, and debug lockless code without worrying about possible compiler bugs.)

Atomics on a single-core CPU

You can vastly simplify it in this case by masking after an atomic_fetch_add, instead of simulating an atomic add with earlier rollover using CAS. (Then readers must mask as well, but that's very cheap.)

And you can use memory_order_relaxed. If you want reordering guarantees against an interrupt handler, use atomic_signal_fence to enforce compile-time ordering without asm barriers against runtime reordering. User-space POSIX signals are asynchronous within the same thread in exactly the same way that interrupts are asynchronous within the same core.

// readers must also mask _head & (FIFO_LEN - 1) before use

// Uniprocessor but with an atomic RMW:
int32_t acquire_head_atomicRMW_UP(void)
{
    atomic_signal_fence(memory_order_seq_cst);    // zero asm instructions, just compile-time
    int32_t old_h = atomic_fetch_add_explicit(&_head, 1, memory_order_relaxed);
    atomic_signal_fence(memory_order_seq_cst);

    int32_t new_h = (old_h + 1) & (FIFO_LEN - 1);
    return new_h;
}

On the Godbolt compiler explorer

@@ gcc8.2 -O3 with your same options.
acquire_head_atomicRMW:
    ldr     r3, .L4           @@ load the static address from a nearby literal pool
.L2:
    ldrex   r0, [r3]
    adds    r2, r0, #1
    strex   r1, r2, [r3]
    cmp     r1, #0
    bne     .L2               @@ LL/SC retry loop, not load + inc + CAS-with-LL/SC
    adds    r0, r0, #1        @@ add again: missed optimization to not reuse r2
    ubfx    r0, r0, #0, #10
    bx      lr
.L4:
    .word   _head

Unfortunately there's no way I know of in C11 or C++11 to express a LL/SC atomic RMW that contains an arbitrary set of operations, like add and mask, so we could get the ubfx inside the loop and part of what gets stored to _head. There are compiler-specific intrinsics for LDREX/STREX, though: Critical sections in ARM.

This is safe because _Atomic integer types are guaranteed to be 2's complement with well-defined overflow = wraparound behaviour. (int32_t is already guaranteed to be 2's complement because it's one of the fixed-width types, but the no-UB-wraparound is only for _Atomic). I'd have used uint32_t, but we get the same asm.

Safely using STREX/LDREX from inside an interrupt handler:

ARM® Synchronization Primitives (from 2009) has some details about the ISA rules that govern LDREX/STREX. Running an LDREX initializes the "exclusive monitor" to detect modification by other cores (or by other non-CPU things in the system? I don't know). Cortex-M4 is a single-core system.

You can have a global monitor for memory shared between multiple CPUs, and local monitors for memory that's marked non-shareable. That documentation says "If a region configured as Shareable is not associated with a global monitor, Store-Exclusive operations to that region always fail, returning 0 in the destination register." So if STREX seems to always fail (so you get stuck in a retry loop) when you test your code, that might be the problem.

An interrupt does not abort a transaction started by an LDREX. If you were context-switching to another context and resuming something that might have stopped right before a STREX, you could have a problem. ARMv6K introduced clrex for this, otherwise older ARM would use a dummy STREX to a dummy location.

See When is CLREX actually needed on ARM Cortex M7?, which makes the same point I'm about to, that CLREX is often not needed in an interrupt situation, when not context-switching between threads.

(Fun fact: a more recent answer on that linked question points out that Cortex M7 (or Cortex M in general?) automatically clears the monitor on interrupt, meaning clrex is never necessary in interrupt handlers. The reasoning below can still apply to older single-core ARM CPUs with a monitor that doesn't track addresses, unlike in multi-core CPUs.)

But for this problem, the thing you're switching to is always the start of an interrupt handler. You're not doing pre-emptive multi-tasking. So you can never switch from the middle of one LL/SC retry loop to the middle of another. As long as STREX fails the first time in the lower-priority interrupt when you return to it, that's fine.

That will be the case here because a higher-priority interrupt will only return after it does a successful STREX (or didn't do any atomic RMWs at all).

So I think you're ok even without using clrex from inline asm, or from an interrupt handler before dispatching to C functions. The manual says a Data Abort exception leaves the monitors architecturally undefined, so make sure you CLREX in that handler at least.

If an interrupt comes in while you're between an LDREX and STREX, the LL has loaded the old data in a register (and maybe computed a new value), but hasn't stored anything back to memory yet because STREX hadn't run.

The higher-priority code will LDREX, getting the same old_h value, then do a successful STREX of old_h + 1. (Unless it is also interrupted, but this reasoning works recursively). This might possibly fail the first time through the loop, but I don't think so. Even if so, I don't think there can be a correctness problem, based on the ARM doc I linked. The doc mentioned that the local monitor can be as simple as a state-machine that just tracks LDREX and STREX instructions, letting STREX succeed even if the previous instruction was an LDREX for a different address. Assuming Cortex-M4's implementation is simplistic, that's perfect for this.

Running another LDREX for the same address while the CPU is already monitoring from a previous LDREX looks like it should have no effect. Performing an exclusive load to a different address would reset the monitor to open state, but for this it's always going to be the same address (unless you have other atomics in other code?)

Then (after doing some other stuff), the interrupt handler will return, restoring registers and jumping back to the middle of the lower-priority interrupt's LL/SC loop.

Back in the lower-priority interrupt, STREX will fail because the STREX in the higher-priority interrupt reset the monitor state. That's good, we need it to fail because it would have stored the same value as the higher-priority interrupt that took its spot in the FIFO. The cmp / bne detects the failure and runs the whole loop again. This time it succeeds (unless interrupted again), reading the value stored by the higher-priority interrupt and storing & returning that + 1.

So I think we can get away without a CLREX anywhere, because interrupt handlers always run to completion before returning to the middle of something they interrupted. And they always begin at the beginning.

Single-writer version

Or, if nothing else can be modifying that variable, you don't need an atomic RMW at all, just a pure atomic load, then a pure atomic store of the new value. (_Atomic for the benefit or any readers).

Or if no other thread or interrupt touches that variable at all, it doesn't need to be _Atomic.

// If we're the only writer, and other threads can only observe:
// again using uniprocessor memory order: relaxed + signal_fence
int32_t acquire_head_separate_RW_UP(void) {
    atomic_signal_fence(memory_order_seq_cst);
    int32_t old_h = atomic_load_explicit(&_head, memory_order_relaxed);

    int32_t new_h = (old_h + 1) & (FIFO_LEN - 1);
    atomic_store_explicit(&_head, new_h, memory_order_relaxed);
    atomic_signal_fence(memory_order_seq_cst);

    return new_h;
}

acquire_head_separate_RW_UP:
    ldr     r3, .L7
    ldr     r0, [r3]          @@ Plain atomic load
    adds    r0, r0, #1
    ubfx    r0, r0, #0, #10   @@ zero-extend low 10 bits
    str     r0, [r3]          @@ Plain atomic store
    bx      lr

This is the same asm we'd get for non-atomic head.

This is an incredibly detailed answer, thanks a lot! But I don't understand which old code was buggy? Everything is inside the `while` loop. Also, multiple interrupts can move the `head`, but only the lowest interrupt will read it. — Lou, Jan 06 '19 at 13:39
Good answer but more PC wise than uC. This code does no guarantee the atomicity of the RMW operation. You have disable interrupts first!! (or at least all with the higher priority if they access the same data) — 0___________, Jan 06 '19 at 13:48
@Lou: oh yes, you're right. I would have written it as a `do{new_h = (old_h+1)&mask;}while(!CAS())` so the extra level of nesting from the `while(1)` and then an `if()` made it ugly. Fixed my answer to not claim that. — Peter Cordes, Jan 06 '19 at 13:51
@P__J__: I'm assuming that an interrupt handler will use `CLREX` before sending execution to C code, so the STREX will correctly fail once execution resumes, if it was interrupted and another core ran this function while we were in the middle of the LDREX/STREX transaction. I guess I should mention that in the answer. Why would you have to disable interrupts if you're using LDREX/STREX to detect successful atomic RMW? It's definitely atomic with respect to other interrupt handlers. — Peter Cordes, Jan 06 '19 at 13:57
But it does not change anything. Imagine you are are in one interrupt handler and you modify the headbefore the store another interrupt with higher priority preemts the execution - reads the **OLD** value, increments it and stores. Then the first interrupt is storing the same value. So the head will be old head+1 and it should be old head +2. One data is lost in the FIFO. — 0___________, Jan 06 '19 at 13:58
@P__J__: In that case, the STREX in the lower-priority interrupt handler will fail, because it will run when not in the state created by a matching LDREX. The STREX in the higher-priority interrupt handler makes that happen. See http://infocenter.arm.com/help/topic/com.arm.doc.dht0008a/DHT0008A_arm_synchronization_primitives.pdf#page=11. Or is there a corner case here that breaks the normal LL/SC operation? I've only really skimmed the docs for this CLREX / open monitor state stuff. — Peter Cordes, Jan 06 '19 at 14:01
@PeterCordes we do not want STREX to fail in this case. How do you imagine handling of the failed store in the interrupt handler? Those instructions are good when we consider RTOS and for example mutex implementation - we can suspend the task and when the location become available we can put it again in the running state. It is not applicable here in the interrupt handler. — 0___________, Jan 06 '19 at 14:06
@P__J__: I imagine it being handled with the retry loop that the compiler generates to implement multithreaded `fetch_add`. Did you even read the asm? I even commented the LL/SC retry loop in the compiler output. It's very much like the CAS retry loop the OP wrote in their C source. If contention is low, it's likely that a compare-and-branch that almost always falls through the first time is cheaper than disabling and re-enabling interrupts. On x86 that takes ~10 cycles on in-order Pentium (or more on OoO exec CPUs); I'd guess that an in-order Cortex-M4 might be similar. — Peter Cordes, Jan 06 '19 at 14:11
But it it will never reach the lower priority interrupt STREX - your program will die in the infinitive loop. Please do not mix x86 examples with the ARM M4 ones as they are not comparable. — 0___________, Jan 06 '19 at 14:13
@P__J__: the idea is that `STREX` is supposed to fail if the code was interrupted, that's why the explicit `while` loop is there for in my code, and that's why Peter's code has a retry loop created by `atomic_fetch_add_explicit`. How can it die in an infinite loop when if just jumps back to `ldrex r0, [r3]` again? — Lou, Jan 06 '19 at 14:15
CPSIE/CPSID instructions are 1 o two clock ones. There is no fastest way — 0___________, Jan 06 '19 at 14:16
@PeterCordes the lower priority interrupt will not get the control anymore (as the higher priority interrupt will loop waiting for the successful store). As I said - those mechanisms are not applicable here. — 0___________, Jan 06 '19 at 14:18
@P__J__: I only mentioned x86 for a cycle cost estimate of toggling interrupts, not for anything about correctness. Anyway, assuming CLREX runs in an interrupt handler before the LDREX/STREX loop, then it's in a good state and can complete (regardless of what it interrupted), and will eventually return to a lower-priority interrupt's LDREX/STREX loop. The lower-priority loop will retry after having its STREX fail, and either complete or be interrupted again. If you have any further objections, please explain in more detail and/or read *my* answer/comments more carefully. — Peter Cordes, Jan 06 '19 at 14:19
How it will return the control or get the control again? You think PC multitasking not the interrupt way. USART interrupt is triggered - you have new data in the DR register. You need need to place it in FIFO. The store fails. You return from the interrupt - lower priority one completes its store. Then another data is recevied by the USART - DR contains the new value - the old is lost. — 0___________, Jan 06 '19 at 14:23
@P__J__: You have `LDREX`, followed by a `ADD` instruction, followed by a `STREX`. You **don't** return from the interrupt when `STREX` fails, you retry the operation as long as you need. — Lou, Jan 06 '19 at 14:25
It will work only if the interrupt is invoked all the time (for example the timer one) or when the task is given execution time on the regular basis (as in the PC OS-es). Even RTOS-es do not work this way. Only the highest priority task is give the time - and there is no switching (except the situation when you have more than oe task with the same priority) — 0___________, Jan 06 '19 at 14:26
@Lou - and you die in the infinitive loop as the lower priority interrupt will never given any execution time - so its store will never happen. — 0___________, Jan 06 '19 at 14:27
@P__J__: *The store fails --> You return from the interrupt* - I think this is where the confusion lies; you do not return from the interrupt until the store succeeds. — Lou, Jan 06 '19 at 14:27
@P__J__: but quite the opposite, nothing can interrupt the higher level interrupt, and it will therefore succeed the first time. — Lou, Jan 06 '19 at 14:28
@PeterCordes Is there any point in using two `atomic_signal_fence`? Is the first one necessary? — Margaret Bloom, Jan 06 '19 at 14:30
@Lou **we consider the situation when the lower priority interrupt was preeempted before the STREX instruction** — 0___________, Jan 06 '19 at 14:30
@P__J__: of course we do, this is the only meaningful situation to consider. Higher interrupt then makes a `LDREX`/`ADD`/`STREX`, succeeds, and returns to the lower interrupt, which then fails and needs to retry. **Any interrupt which does not get interrupted between `LDREX` and `STREX` will succeed.** — Lou, Jan 06 '19 at 14:31
@PeterCordes how the lower priority interrupt STREX may be executed when the higher interrupt routine is in a loop? This is the main question - answer it if you know such a method — 0___________, Jan 06 '19 at 14:31
@P__J__: why would STREX fail if it's not interrupted again? Like I've explained multiple times, a CLREX before running this C function will definitely ensure that nothing is stopping a LDREX/STREX from succeeding if it's not interrupted, or from being able to retry an succeed if it is interrupted. You still haven't explained how this can get stuck in a loop. You keep claiming that, but you haven't linked to any architecture documentation about why a LDREX/STREX loop could fail repeatedly if not interrupted. — Peter Cordes, Jan 06 '19 at 14:31
@MargaretBloom: The first one is probably not necessary. There's probably no need for acquire semantics to stop the load from moving earlier, before anything in the the caller. It doesn't cost any asm instructions so it seemed reasonable to make extra sure that it couldn't introduce any problems. But yeah it probably only stops the load from being moved earlier to better hide the latency, in that single-writer / multiple-reader case. The multiple-writer case of course needs an atomic RMW, so the compiler can't usefully move it before other memory operations, so there's nothing to lose there. — Peter Cordes, Jan 06 '19 at 14:37
@PeterCordes: I actually believe `CLREX` is not needed with single-core, interrupts-only code (i.e. if we don't have an RTOS or multiple cores), because `STREX` clears the local record too. So the higher interrupt will perform a `LDREX` to get the fresh value, `STREX` will succeed **and clear the exclusive access record**, and then the lower interrupt `STREX` will fail because the exclusive record has been cleared. With multiple threads, it's possible for a different thread to call `LDREX` before having a chance to run `STREX` as [explained here](https://stackoverflow.com/a/51169716/1488067). — Lou, Jan 06 '19 at 15:39
@Lou: Yeah, after reading the docs more carefully, I realized the same thing and just finished an update to this answer explaining my reasoning. :P (and added a section at the top about interruptible LL/SC atomics trading latency for possible throughput, if clearing/enabling interrupts is faster than using LL/SC. Anyway, the doc I found doesn't say that an LDREX from the *same* address re-opens the local monitor, so it might fail the first time through the loop in the higher prio interrupt. But if Cortex-M4's LL/SC state machine is as simple as possible, SC succeeds if the prev insn was LL. — Peter Cordes, Jan 06 '19 at 15:47
@PeterCordes I'm not familiar with `atomic_signal_fence` but, isn't it a compiler only barrier? Doesn't it just prevent the compiler from reordering the instruction? There's nothing before the `atomic_fetch_add_explicit` :) — Margaret Bloom, Jan 06 '19 at 17:23
@MargaretBloom: Yes, it's a compiler-only barrier. This function isn't the interrupt entry point. It's tiny and could easily inline into its caller. *That's* what's before it. Unless you've explicitly used `__attribute__((noinline))` or something, never assume that your C function = a stand-alone asm function. — Peter Cordes, Jan 06 '19 at 17:26

0___________ · Answer 2 · 2019-01-06T14:57:08.297

Your code is written in a very not "bare metal" way. Those "general" atomic functions do not know if the value read or stored is located in the internal memory or maybe it is a hardware register located somewhere far from the core and connected via buses and sometimes write/read buffers.

That is the reason why the generic atomic function has to place so many DMB instructions. Because you read or write the internal memory location they are not needed at all (M4 does not have any internal cache so this kind of strong precautions are not needed as well)

IMO it is just enough to disable the interrupts when you want to access the memory location the atomic way.

PS the stdatomic is in a very rare use in the bare metal uC development.

The fastest awy to guarantee the exclusive access on M4 uC is to disable and enable the interrupts.

__disable_irq();
x++;
__enable_irq();

  71        __ASM volatile ("cpsid i" : : : "memory");
080053e8:   cpsid   i
 79         x++;
080053ea:   ldr     r2, [pc, #160]  ; (0x800548c <main+168>)
080053ec:   ldrb    r3, [r2, #0]
080053ee:   adds    r3, #1
080053f0:   strb    r3, [r2, #0]
  60        __ASM volatile ("cpsie i" : : : "memory");

which will cost only 2 or 4 additional clocks for both instructions.

It guarantees the atomicity and does not provide unnecessary overhead

Your answer is useful, so I upvoted it, but the last "does not provide unnecessary overhead" is debatable if you want to ensure the lowest possible latency for higher priority interrupts. In terms of latency, this approach will effectively cost you 2 to 4 additional clocks for `cpsid`/`cpsie` + all clocks needed to execute the instructions between `cpsid` and `cpsie`. — Lou, Jan 06 '19 at 15:33

score 0 · Answer 3 · answered Jan 06 '19 at 13:19

0

dmb is required in situations like

p1:
    str r5, [r1]
    str r0, [r2]

and

p2:
    wait([r2] == 0)
    ldr r5, [r1]

(from http://infocenter.arm.com/help/topic/com.arm.doc.genc007826/Barrier_Litmus_Tests_and_Cookbook_A08.pdf, section 6.2.1 "Weakly-Ordered Message Passing problem").

In-CPUI optimizations can reorder the instructions on p1 so you have to insert an dmb betweeen both stores.

In your example, there are too much dmb which is probably caused by expanding atomic_xxx() which might have dmb both at start and end.

In should be enough to have

acquire_head:
        ldr     r2, .L8
        dmb     ish
.L2:
        // int32_t old_h = atomic_load(&_head);
        ldr     r1, [r2]
...
        bne     .L5
.L6:
        bne     .L2
        dmb     ish
        bx      lr

and no other dmb between.

Performance impact is difficult to estimate (you would have to benchmark code with and without dmb). dmb does not consume cpu cycles; it just stops pipelining within the cpu.

answered Jan 06 '19 at 13:19

ensc

6,704
14
22

That would be true on a multi-core ARM running multi-threaded code, but that's not what this question is about. The OP is on a single-core CPU running single-threaded code. But yes, the `atomic` functions are seq_cst by default, and gcc doesn't optimize away redundant barriers. – Peter Cordes Jan 06 '19 at 13:33
@PeterCordes OP is using interrupts; so, it is some kind of "multi-threaded". – ensc Jan 06 '19 at 13:40
This is very outdated document - and it is not applicable here. There are no cache or any other problems except the hardware registers access. – 0___________ Jan 06 '19 at 13:43
Oh right, then the OP is on a single-core CPU, so everything happens in program order. It's impossible to observe memory-reordering at run-time when everything runs on the same core, because the CPU preserves the illusion of program-order for the core. Like I said in my answer. – Peter Cordes Jan 06 '19 at 13:43
@ensc it is not. – 0___________ Jan 06 '19 at 13:44
@P__J__ what do you mean with "it is not"? For this problem, there is no big difference between interrupts and threads. Threads on single-core might mitigate the reordering problem by `dmb` in the scheduler, but for interrupts you have to look explicitly in the CPU spec whether `dmb` is implied (like `clrex` on Cortex M4). Memory reordering does no require caches to occur. – ensc Jan 06 '19 at 13:54
@ensc: AFAIK, reordering only happens within the pipeline, so you should never experience reordering issues on a single core, which does not have a cache. Once you enter an interrupt, whatever was reordered will have finished executing. – Lou Jan 06 '19 at 18:58

DMB instructions in an interrupt-safe FIFO

3 Answers3

Atomics on a single-core CPU

Safely using STREX/LDREX from inside an interrupt handler:

Linked