Disabled DCache will prevent atomic_flag from being set

Question

We are using a zynq-7000 based CPU, so an cortex-a9 and we encountered the following issue while using atomic_flags which are inside an library we are using (open-amp).

We are using the second CPU on the SoC to execute bare-metal code.
When disabling the dcache, atomic ints are no longer able to be set, here's a simple code which triggers the issue for us:

#define XREG_CONTROL_DCACHE_BIT (0X00000001U<<2U)
#define XREG_CP15_SYS_CONTROL   "p15, 0, %0,  c1,  c0, 0"
#define mfcp(rn)    ({uint32_t rval = 0U; \
             __asm__ __volatile__(\
               "mrc " rn "\n"\
               : "=r" (rval)\
             );\
             rval;\
             })
#define mtcp(rn, v) __asm__ __volatile__(\
             "mcr " rn "\n"\
             : : "r" (v)\
            );

static void DCacheDisable(void) {
    uint32_t CtrlReg;
    /* clean and invalidate the Data cache */
    CtrlReg = mfcp(XREG_CP15_SYS_CONTROL);

    CtrlReg &= ~(XREG_CONTROL_DCACHE_BIT);
    /* disable the Data cache */
    mtcp(XREG_CP15_SYS_CONTROL, CtrlReg);
}

int main(void) {
    DCacheDisable();

    atomic_int flag = 0;
    printf("Before\n");
    atomic_flag_test_and_set(&flag);
    printf("After\n");
}

The CPU executes the following loop for atomic_flag_test_and_set:

dmb     ish
ldrexb  r1, [r3] ; bne jumps here
strexb  r0, r2, [r3]
cmp     r0, #0
bne     -20     ; addr=0x1f011614: main + 0x00000060
dmb     ish

but the register r0 always stays 1. When omitting the function call to DCacheDisable, the code works flawlessly.

I really can't find any any information about disabled dcache and atomic flags.

Does anybody has a clue?

Toolchain: We are using vitis 2022.2 which comes with a arm-xilinx-eabi-gcc.exe (GCC) 11.2.0. Compiler options are -O2 -std=c11 -mcpu=cortex-a9 -mfpu=vfpv3 -mfloat-abi=hard

That makes sense; apparently `strexb` always fails because the L1d cache didn't maintain exclusive ownership of that cache line since the `ldrexb`. Because caching was disabled. Apparently the microarchitecture doesn't have an alternative bus-lock method of making LL/SC transactions succeed without cache. (Interesting, can ARM in general not do atomic RMWs on uncacheable memory regions, like MMIO?) Perhaps with ARMv8.1 single-instruction RMWs on cores that support it? Or with the legacy `swp` instruction? — Peter Cordes, May 09 '23 at 08:14
Why are you using `atomic_flag_test_and_set( atomic_flag * )` on an `atomic_int`? (https://en.cppreference.com/w/cpp/atomic/atomic_flag_test_and_set). That's strict-aliasing UB and should give a warning if it compiles at all. Use `atomic_exchange(&flag, 1)` if you want to use `atomic_int` or `atomic_bool`. The asm is fine, though, so that C weirdness isn't the problem. — Peter Cordes, May 09 '23 at 08:16
@PeterCordes that sounds like a promising solution. Thanks for that detailed input. I "just copied" the code that open-amp was using to generate a MCVE. See https://github.com/OpenAMP/open-amp/blob/accac4d3610cbb268f3c3fe3c31dc45dd4c4dd17/apps/machine/zynq7/zynq_a9_rproc.c#L69 and https://github.com/OpenAMP/open-amp/blob/accac4d3610cbb268f3c3fe3c31dc45dd4c4dd17/apps/machine/zynq7/platform_info.h#L50 — hellow, May 09 '23 at 08:36
I believe I had similar issue a while ago. But that was some 'Rockchip' AArch64 SoC. And changing atomic variable was hanging cpu when cache was disabled. Surprisingly same code worked fine on Zynq7000 (AArch32) and Zynq-Ultrascale (in AArch64 mode). — user3124812, May 10 '23 at 09:44

score 3 · Accepted Answer · edited May 09 '23 at 15:52

This is common on ARM platforms that support a cache. The cache line is being used as a temporary store for the exclusive lock. The term in ARM is exclusive reserve granule or the size of locked memory. On systems with a cache, you will find the granule is a cache line size.

So internally, the ldrex and strex are implemented as part of the cache resolution policy. You can compare it to cortex-m systems, where the entire memory space is a reserve granule.

The ldrex/strex pair are useless for synchronizing with external devices that are not part of an AXI structure. If you want to disable cache to work with an FPGA interface, I don't believe this can work. You would need to implement the cache protocol in the FPGA.

For Cortex-M systems, there is no cache structure and custom logic implements a 'global monitor'.

The cache mechanism actually seems useful as the cache line could be used as a transactional memory. Ie, either the whole line commits on not. It seems possible to create lock-free algorithm for structures with multiple pointers. The node do not lock an entire list but only an entry. However, I haven't seen it used like this ever. Mainly I think because the ARM documentation recommends not to do this (do not rely on the ERG size).

The [C++ 'memory model'](https://stackoverflow.com/questions/6319146/c11-introduced-a-standardized-memory-model-what-does-it-mean-and-how-is-it-g) does seem to generate good code with `g++` (and probably clang as well). Although, it may fail for your use case, (Cortex-A bare metal). I would recommend it for user of Linux or a Cortex-M system as opposed to custom coding. The compiler will do data flow analysis, so it is actually much better than an `atomic_flag_test_and_set()` routine. Unfortunately, the Cortex-A is the most complex beast in this regard. — artless noise, May 09 '23 at 14:35

Disabled DCache will prevent atomic_flag from being set

1 Answers1