ARM Cortex-M4 Mutex Lock. DMB Instruction

Question

I read following document: Barrier_Litmus_Tests_and_Cookbook by ARM.

Section 7.2 shows the code of acquiring a mutex/semaphore.

Loop
   LDREX R5, [R1] ; read lock
   CMP R5, #0 ; check if 0
   STREXEQ R5, R0, [R1] ; attempt to store new value
   CMPEQ R5, #0 ; test if store suceeded
   BNE Loop ; retry if not
   DMB

The LDREX instruction requests exclusive access on the memory address. The writing with STREX only succeeds if the processor has exclusive access. They use a DMB instruction to ensure that the exclusive write is synchronized to all processors.

I have a small problem with that. Assume the processor has exclusive access to the memory address and locks it. Once the STREX instruction is finished, the exclusive access is removed. Other processors can access this memory from now on. However, the write is still in the Cache of the processor until DMB finishes. What happens if another processor tries to acquire access to the lock when the first processor has already locked it but it is not synced to RAM yet. The memory address is not exclusively locked to the first processor but the write is not finished.

Can anyone explain, why this does work and is safe. I have my problems with that.

`DMB` orders the memory accesses by making them visible up to various points in the memory hierarchy (i.e. caches). A cache coherence mechanism is necessary to allow a processor for snooping another one cache otherwise, the situation you hypothesised could very well happen. ARM caches are coherent. Further `DMB` without options will do a full-system sync, I don't know what it means but I could be a write up to the memory. — Margaret Bloom, Mar 09 '17 at 16:08
The `DMB` is not necessary for the atomicity of the read-modify-write. The `DMB` ensures that *other* memory operations are ordered with regard to the RMW-operation. — EOF, Mar 09 '17 at 19:23
the DMB is to push the load and store out, let them complete so other see it..."ensure the successful claim of the lock is observed by all observers before they observe any subsequent loads or stores." just like you use other memory barriers, flushing write buffers and caches and such. — old_timer, Mar 09 '17 at 23:13
and this does appear to be a cortex-a manual, so dont try to apply it to a cortex-m. — old_timer, Mar 09 '17 at 23:13
they are using a non-exclusive store to "clear the lock" basically the logic will see that someone without the proper processorid and exclusive access has touched that address. one of the two things the ldrex/strex is looking for (non-exclusive access to that address or any access from another processor). — old_timer, Mar 09 '17 at 23:16
With a Cortex-M4 CPU you only need to use the DMB instruction when there's a CPU cache and there's some other processor or device that wouldn't see the memory accesses in order without using the instruction. In particular you don't need to use DMB with semaphores and mutexes unless there's another processor that can also access the semaphore or mutex. — Ross Ridge, Mar 09 '17 at 23:53

old_timer · Accepted Answer · 2017-03-10T00:13:59.543

I think you are over complicating it. Look at the amba/axi spec (and also where did you find a multi-core cortex-m4?). ldrex/strex are for sharing a resource across processors in a multi-processor chip. They have been incorrectly used for other things for some time now. ARM unfortunately did an unusually bad job of documenting all of this correctly.

The exclusive part of the ldr is that the processorid and the address (range) are saved in a table. When an strex happens the processorid for that address (range) is checked if it matches EXOKAY and do the store if not OKAY and dont. Strex does not clear anything, they interestingly have this clrex instruction which I assumes sets the processorid to some value that wont hit or depending on how they build their tables they free up a table entry.

I may try this after writing this but you can just as easily ldrex then strex then strex, fairly certain I have done int on full sized arms, will try it on a cortex-m4 ldrex, strex, strex, clrex, strex and see what happens.

In a uniprocessor system, ldrex/strex are expected to work in ARM's logic but the chip vendor is not required to support it and may simply return OKAY (instead of EXOKAY). The L1 certainly and probably L2 are arm logic beyond that you get into chip vendor. (do cortex-ms have an l2?). Normally you are not going to have to worry about hitting the chip vendor code, you can run a long time if not indefinitely without knowing any of this as you will remain in one of the caches. And disabling both caches in Linux for example is a royal PITA, they may make it seem like it is a compile time option, but dig in and see the reality. And with only one processor how do you get a different processor id?

In multi-processor chips, the chip vendor is supposed to support it correctly beyond the caches if you can even get there with an exclusive access, how ldrex/strex are used normally, you are most likely to be within your L1 cache and never get exposed to what the chip vendor has provided, but it can happen if you get interrupted in between and you are likely saved by the L2. And in this case having more than one processorid in the chip makes sense, as there is more than one processor.

This is nice

The Cortex-M4 processor implements a local exclusive monitor. The local monitor within the processor has been constructed so that it does not hold any physical address, but instead treats any access as matching the address of the previous LDREX. This means that the implemented exclusives reservation granule is the entire memory address range.

The m7 trm says the same thing.

Not having multiple cores how could/would one generate a different ID? The docs are using the term processorid to indicate which processor is being used. How many processors are in a cortex-m? Perhaps it is documented elsewhere using a different string/name, but at this time I dont know how the processorid in a cortex-m is generated and being a uniprocessor is there more than one? I dont have access to a core to know for sure.

So even though the logic does not support a per-address exclusive access, they didnt say they didnt check the processorid, they simply consider all strex access for memory marked as shared to be checked against the processorid of the last ldrex independent of its address.

EDIT

PUT32(0x01000600,0x600);
PUT32(0x01000700,0x700);
PUT32(0x01000800,0x800);
CLREX();
hexstring(STREX(0x20000600,0x12345678));
hexstring(STREX(0x20000700,0x12345678));
hexstring(STREX(0x20000800,0x12345678));
hexstring(LDREX(0x20000600));
hexstring(STREX(0x20000600,0x6666));
hexstring(STREX(0x20000700,0x12345678));
hexstring(STREX(0x20000800,0x12345678));
hexstring(LDREX(0x20000600));
hexstring(STREX(0x20000700,0x7777));
hexstring(STREX(0x20000800,0x12345678));
hexstring(GET32(0x20000600));
hexstring(GET32(0x20000700));
hexstring(GET32(0x20000800));
CLREX();
hexstring(0xAABBCCDD);
hexstring(LDREX(0x20000600));
CLREX();
hexstring(STREX(0x20000600,0x2222));
hexstring(GET32(0x20000600));

producing

00000001 
00000001 
00000001 
00000600 <-- ldrex
00000000 <-- strex pass
00000001 <-- strex fail
00000001 
00006666 
00000000 
00000001 
00006666 
00007777 
00000800 
AABBCCDD 
00006666 
00000001 
00006666

So looks like what they did here is the next strex after an ldrex passes independent of address. So using your terms the strex "clears the lock".

And note that putting a clrex between the ldrex and strex does make the strex fail.

Not hitting the same address doesnt matter one ldrex to one strex

hexstring(LDREX(0x20000900));
hexstring(STREX(0x20000900,0x2222));
hexstring(STREX(0x20000900,0x2222));

3EEDCC1B 
00000000 
00000001

Turning the data cache on didnt change the results.

Test functions:

.thumb_func
.globl LDREX
LDREX:
    ldrex r0,[r0]
    bx lr

.thumb_func
.globl CLREX
CLREX:
    clrex
    bx lr

.thumb_func
.globl STREX
STREX:
    strex r0,r1,[r0]
    bx lr

Unlike the big brother ARMs:

CLREX();
hexstring(STREX(0x20000600,0x12345678));
hexstring(LDREX(0x20000600));
hexstring(STREX(0x20000600,0x6666));
hexstring(LDREX(0x20000600));
PUT32(0x20000600,0x11);
hexstring(STREX(0x20000600,0x6666));

00000001 
00000600 
00000000 
00006666 
00000000

The strex survives the non exclusive access in between, at least based on the document you posted a non-exclusive store should spoil the prior ldrex (on an armv7-a).

Note the above is on a cortex-m4 r0p1 CPUID 0x410FC241

A mutex requires a compiler-ordering barrier. ARM's example of how to use DMB presumes that a compiler's implementation of CMSIS will treat it as such a barrier, but I wouldn't trust certain compilers to treat it like one. — supercat, Oct 08 '19 at 20:31

score 0 · Answer 2 · answered Feb 06 '18 at 05:49

It is safe because the chip designer makes it safe. The whole point of Test_and_Set instructions is to be used by the operating system for semaphone and mutex commands. In a multi-core/multi-processor environment, there would be no other way to implement this feature accurately except by a built-in assembly command.

ARM Cortex-M4 Mutex Lock. DMB Instruction

2 Answers2

Linked