How to write atomic RMW-sequences on Cortex-M4 in C++

Question

In the following example are 4 versions to atomically increment (or use other form of rmw-statements) on a variable a1 or a2 (depending on the version). The variable a1 or a2 may be shared with some form of ISR.

The question is according to a Cortex-M4 (STM32G431). The compiler is g++ (see below).

Version 1: As I understand, entering a ISR issues a clrex automatically, so that the first strex always fails, if the sequence is interrupted. Correct? And therefore the ISR does not have to use ldrex/strex also? The implicit clrex works as sort of global memory clobber: would be possible to limit the clobber to a2 in the ISR?

Version 2: Do the __disable_irq()/enable_irq() contain a compile-time barrier? So, are the explicit barries unneccessary? Would it be better (performance) to disable only the IRQ that could modify the variable a2?

Comparing Version 1 und 2: If no IRQ hits the sequence, both should use the same number of CPU-cycles, but if any IRQ arises, Version 1 uses more cycles?

Version 3: This produces additional dmb barrier instructions. But as I understand, these dmb are not neccessary on single-core M4?

Version 4: Does not generate the dmb as in Version 3. Should this be the preferred way on single-core?

#include <stm32g4xx.h>
#include <atomic>

namespace  {
    std::atomic_uint32_t a1;
    uint32_t a2;
}

int main(){
    while(true) {
        // 1
        uint32_t val;                                            
        do {                                                     
            val = __LDREXW(&a2); 
            val += 1;
        } while ((__STREXW(val, &a2)) != 0U); 
        
        // 2
        __disable_irq();
        std::atomic_signal_fence(std::memory_order_seq_cst);    // compile-time barrier really neccessary?
        ++a2;          
        std::atomic_signal_fence(std::memory_order_seq_cst);    // compile-time barrier really neccessary?
        __enable_irq();
        
        // 3
        std::atomic_fetch_add(&a1, 1);
        
        // 4
        std::atomic_signal_fence(std::memory_order_seq_cst);    // compile-time barrier
        std::atomic_fetch_add_explicit(&a1, 1, std::memory_order_relaxed);
        std::atomic_signal_fence(std::memory_order_seq_cst);
    }
}

Compile the above with arm-none-eabi-g++ -I../../../STM32CubeG4/Drivers/CMSIS/Core/Include -I../../../STM32CubeG4/Drivers/CMSIS/Device/ST/STM32G4xx/Include -I../../../STM32CubeG4/Drivers/STM32G4xx_HAL_Driver/Inc -DSTM32G431xx -O3 -std=c++23 -fno-exceptions -fno-unwind-tables -fno-rtti -fno-threadsafe-statics -funsigned-char -funsigned-bitfields -fshort-enums -ffunction-sections -fdata-sections -fconcepts -ftemplate-depth=2048 -fstrict-aliasing -Wstrict-aliasing=1 -Wall -Wextra -I. -mthumb -mcpu=cortex-m4 -mfpu=fpv4-sp-d16 -mfloat-abi=hard -fverbose-asm -Wa,-adhln -S -o test99.s test.cc

Just use `a.fetch_add(1);`. Memory barriers are nops in single core architectures, so they cost almost nothing. Put your actual barrier needs in argument if you really want to micro optimize. Technically disabling ISRs could be slightly better, but you would need so much contention for it to be worth it, your software would have other problems before. — ElderBug, Aug 08 '23 at 14:00
Where do you get the fact, that `dmb` on M4 ist zero cycle? In https://developer.arm.com/documentation/ddi0439/b/Programmers-Model/Instruction-set-summary/Cortex-M4-instructions it says `dmb` costs minimum 1 cycle. — wimalopaan, Aug 08 '23 at 14:06
Never said it costs zero cycles, I'm saying it's a nop, which also cost 1 cycle. Since the architecture is relatively simple I don't think you can have 0 cycle instructions. — ElderBug, Aug 08 '23 at 14:11
@ElderBug: You could potentially post-process GCC's asm output files (`.s`) with `sed` or `grep -v` or something to remove `dmb ish` instructions entirely. Or maybe there's a GAS option to fully omit them. Using `memory_order_relaxed` in the source would avoid them, but that would allow compile-time reordering which you *don't* want. — Peter Cordes, Aug 08 '23 at 15:50
Re: entering an ISR automatically doing a `clrex`: on at least *some* ARM cores, an ISR has to manually do a `clrex`. But yes, apparently on Cortex-M it's automatic: [When is CLREX actually needed on ARM Cortex M7?](https://stackoverflow.com/a/54294473) — Peter Cordes, Aug 08 '23 at 15:52
@PeterCordes: do you mean thean version 4 is wrong (this avoids the `dmb sih`)? — wimalopaan, Aug 08 '23 at 15:57
Oh, I hadn't read the full question since I was in a hurry. Yeah, good idea using `atomic_signal_fence` to maintain compile-time ordering without `dmb`, while using `relaxed` for the RMWs. That does indeed solve all the problems for single-core, but makes the source look clunky and not be safe to compile for other targets. :/ (Promoting them all to `atomic_thread_fence` would make correct but slower code for other targets, like x86 or AArch64, including around pure loads.) — Peter Cordes, Aug 08 '23 at 16:07
Btw, I'm mentioning this just in case, but in many cases if you only have one writing "thread", and the others are only reading, you don't need any atomic add, just atomic read and write. Which will be faster than anything here. — ElderBug, Aug 08 '23 at 17:09
@ElderBug: Sure, one common case would be a SysTick-ISR incrementing a uint32_t. In this case the non-ISR code could simply read the uin32_t, while the ISR simply could increment it (if SysTick IRQ has highest prio). But in case the counter would be uint64_t, the reading non-ISR code has to use `disable_irq()`. The `ldrex` wouldn't help in this case, I think. — wimalopaan, Aug 08 '23 at 18:21
For a 64-bit counter with one (preferably infrequent) writer and multiple readers, the SeqLock pattern is usually very good: [Implementing 64 bit atomic counter with 32 bit atomics](https://stackoverflow.com/q/54611003). Especially if you actually have multiple cores, since it keeps the readers purely read-only, not needing exclusive ownership to verify read atomicity. It's also viable on Cortex-M, and avoids any need to disable interrupts on MCUs that can't do a 64-bit load or store with a single instruction (which would make atomic wrt. interrupts on most ISAs, including ARM.) — Peter Cordes, Aug 19 '23 at 04:22

artless noise · Answer 1 · 2023-08-17T16:28:28.067

Let's go through history...

Version 2: Do the __disable_irq()/enable_irq() contain a compile-time barrier? So, are the explicit barries unneccessary? Would it be better (performance) to disable only the IRQ that could modify the variable a2?

On a single core, disabling interrupts works. However, this will increase interrupt latency (globally). This can be hard to test, because you need an interrupt to happen exactly on these lines. It may not be good to miss an interrupt in a pace maker and you have created a latency which is hard to reproduce. As well, ALL interrupts pay the price even if they do not access the variable. So, for this case, it is fine, but then maybe you have 80-90 areas like this in the code. So, as well, it does not scale. This was typically the only way to do things before the early 2000 CPU designs.

Version 1, 3, and 4 are all variations on the same theme. Here only the code that is involved with the atomics will have some delays. This is more scalable. ~~If the code is never active, no one pays the price.~~ Un-involve interrupts will not be delayed.

Part of the issue is a1 versus a2. It may take multiple cycles to change a memory element. For instance, you have a 16 bit bus and need to write the low part and then the high part. So the C/C++ API is geared to handle this case, but 32 bit aligned types are atomic (versus access on the same core) on the ARM CPUs. This might not translate if you have a 16 bit CPU.

You might be correct that version 3 is performing an unneeded dmb, but the code is concise, it could be updated in a future version of gcc/g++ and there can be some Cortex-M4 with MPUs and/or cache structure where this might be relevant. As well as the single core, you can also think about peripherals with DMA engines. If you have made the choice to use C++, std::atomic_fetch_add() is a gain.

The simple analysis you have done does not take into account more complex code and advanced optimization. The compiler can inline, duplicate and perform code motion. All of these can create new situation that a simple test case of the generated assembler will not expose. The micro-optimization of version 1 and 4 may benefit in this limited test. However, in real code, there are multiple accesses to manage. The compiler (modern register allocation, SSA and data flow) are very good at optimizing these. Looking at limited access patterns will not take other optimizations into account. Given this, I would stick to std::atomic_fetch_add() as the tool makers have a better understanding of these things. And the general caveat against 'pre-mature optimization'.

The other caveat is that your code might be ported to some other ARM cpu or a completely different architecture. Do these micro-optimizations hold in these cases? Sorry, I don't know. And no one does, because you can not fathom some future CPU design.

How to write atomic RMW-sequences on Cortex-M4 in C++

1 Answers1