How is this a guarantee a value has been atomically updated in ARM?

Question

ARM provides LDREX/STREX to atomically load/store values, but I feel like I'm missing something in how this is still an atomic operation. The following below is generally how an increment by one would be done. However, what's preventing something from preempting during the ADD instruction, thereby making it so that r2 no longer matches what is stored in [r0]?

(Assuming r0 is valid and r1 = 1)

ADD
    LDREX r2, [r0]
    ADDS  r2, r2, #1
    STREXNE r2, r1, [r0]        @ Store 1 if the original [r0] was not -1
    CMPNE r2, #1
    BEQ ADD

When the operation is preempted, the code behaves as if it never happened and `r2` is set to 0 to indicate this condition. The code can then try again. — fuz, Jan 07 '21 at 19:17
fuz: Or at least, as if the store never happened. The other instructions in between are not rolled back, AFAIK. — Nate Eldredge, Jan 07 '21 at 19:22
Is that supposed to be `ADDS` instead of `ADD`, so that you only do the store if the value in `[r0]` was previously `-1`? Otherwise I'm not clear what the conditional is supposed to accomplish. And as the code stands, you're unconditionally storing `1` rather than the incremented value from `[r0]`; the operands to `STREXNE` may be reversed, and in that case the `CMPNE` probably wants to have `r1` instead of `r2`. — Nate Eldredge, Jan 07 '21 at 19:40
Sorry, yes, I typed this in a bit of a hurry. It's not the point, however. The point is that in between atomic loads and stores, there is an add. Imagine a bunch of threads are calling this simultaneously, how is it that we are guaranteed that the value stored will actually be updated correctly? For instance, if one thread reads the value is 100, gets to the add instruction, gets interrupted, and another thread reads 100, adds 1, and stores 101. Then the 1st thread would resume, and store 101, instead of 102, which is what it should be. — Maxthecat, Jan 07 '21 at 19:47
@Maxthecat: I see. As old_timer points out, what happens in that case is that the original thread does *not* store at all. The `STREX` instruction detects that someone else wrote that address in the meantime and instead of doing the store it returns failure, causing the loop to repeat and load the new value 101. That is precisely the extra functionality that the `*EX` instructions offer over regular load/store. — Nate Eldredge, Jan 09 '21 at 00:56
@Nate: I think an interrupt handler must also clear the "monitor" (CLREX or a dummy strex, except on some Cortex-M CPUs where this happens automatically on an interrupt), otherwise if you context switch from an LDREX in one thread to an STREX in another thread, a false-positive ("successful" strex) is possible on a CPU that doesn't monitor by address. See [When is CLREX actually needed on ARM Cortex M7?](https://stackoverflow.com/q/51162344). But with a well-behaved OS, yes, [Atomic operations in ARM](https://stackoverflow.com/q/11894059) applies: STREX ensures exclusivity across threads. — Peter Cordes, Jan 09 '21 at 02:29
@PeterCordes Since context switches in preempt schedulers are based off a timer interrupt, the you can see that CLREX is needed by the OS. This is handling the single CPU case the OP asks of. In the case of multiple CPUs, the cache controller will detect conflict through the [MOSI protocol](https://en.wikipedia.org/wiki/MOSI_protocol) (or whatever protocol is being used) and make the STREX fail as the executing CPU will observer a stale cache line you are operating on. If the CPU doesn't have a cache, it will implement some minimized logic of MOSI on the entire address space. — artless noise, Jan 09 '21 at 21:41
@artlessnoise: Am I misreading [this answer](https://stackoverflow.com/a/54294473/224132) which quotes ARM docs: *In Cortex-M processors, the local exclusive access monitor clears automatically on an exception boundary, so exception handlers using CLREX are optional.* - The answer says it applies to interrupt handlers as well, so I assumed it was correct that ARM is using "exception" to cover any kind of interrupt. Are you saying it only applies to synchronous exceptions triggered by running code, not external interrupts like a timer? — Peter Cordes, Jan 10 '21 at 01:58
@PeterCordes No, I am referring to a Cortex-A. Cortex-M do not cache normally. The Cortex-M is used in single-CPU configurations. See: [Cortex-M multi-core design](https://community.arm.com/cfs-file/__key/telligent-evolution-components-attachments/01-1989-00-00-00-00-52-92/Multi_2D00_core-microcontroller-design-with-Cortex_2D00_M-processors-and-Cor.pdf). The cortex-M are not designed for SMP data sharing like a Cortex-A. I don't see where the OP is stating their system. I think my comment is helpful for understanding LDREX/STREX on a Cortex-A (via your mention of `clrex`). — artless noise, Jan 10 '21 at 14:17
@artlessnoise: Ok, you were adding to / amplifying my comment for other cores, not correcting it for the Cortex-M case where I said CLREX isn't needed. (Because it still *would* be needed if interrupts didn't effectively do CLREX for you, because you need STREX to fail if you context-switch away and back between ldrex and strex. It must fail due to other threads maybe having run on the same core, which is a problem even without cache.) — Peter Cordes, Jan 10 '21 at 14:32
@PeterCordes Care to explain the Cortex-R? :-) I think there are about 3-10 commercially available. It supports an 'MMU' like structure, but is 'Real-time'. The Cortex-M is quite different than the Cortex-A and 'bus' structure is pretty important in this question. I think it is possible that a FIQ could not need a `clrex` in a well defined system. So some interrupts may not need a `clrex`, if you can guarantee they will not use a *reserve granule*. Probably difficult to guarantee, so the Cortex-M does it automatically. — artless noise, Jan 10 '21 at 14:54
@artlessnoise: I don't know anything about Cortex-R, and not enough about ARM in general, xD. >. — Peter Cordes, Jan 10 '21 at 14:57

score 2 · Answer 1 · edited Jan 09 '21 at 21:31

The ldrex/strex work based on the logic keeping track of exclusive accesses relative to a process id, both of which are presented on the bus at the time.

so if there is an access between the ldrex and strex

ldrex process x
strex process x

due to interrupt or other, the logic is supposed to return a not okay and the strex returns:

1 If the operation fails to update memory.

as documented.

Now the gray area here is multi-fold. The arm logic itself (caches made by arm the l1 and if you bought an l2 from them) will support exclusive access. At one time the arm documentation and it may still be there, if this is a uniprocessor (only one core implemented) you do not have to support exclusive access. And you may find that the non-support simply returns an EXOKAY instead of OKAY on the bus (success vs fail) instead of actually keeping track. But you have to get that access to miss the layers of caching, which means they are off which pretty much means you are not running an os as it is a pain to disable or not enable the cache.

The hardware folks are/were told that you do not have to support exclusive access for uniprocessors. And the general population that ldrex/strex are NOT a replacement for SWP (which is still present in a number of cores). That ldrex/strex are specifically for multiple cores to share resources, it is to allow the different cores to talk to each other basically and share resources, it is not for one core to compete with itself.

The software folks were told in places that they are a replacement for SWP. Also you have the problem of the process id, if uniprocessor do you have different IDs on these transactions? If so how and when did you set those ids? Even if the hardware is implemented to properly support exclusive access, and multi-processor, if your two threads share the same id, or all the threads on that core share the same id then they will interfere with each other. This should be trivial to test though with an experiment.

The software in particular Linux community is focused on it being a replacement for swp, which made it hard for the one/few vendors that read the you don't have to support it and that made Linux not work. At the same time there are a disturbing number of bugs in the Linux kernel related to arm in particular, it takes a lot of work to port each new release as so many improperly done errata and other workarounds are placed. And I suspect many people porting Linux are not aware of the bugs they are creating and or leaving in their ports.

In short the theory is that each thread has its own process id and the logic is keeping track of accesses to the addresses in question and the process id, and if there is an access in between one processes ldrex and strex, then the strex will fail and you have to start over with another ldrex, this is why it is in a loop.

so

ldrex id x
...
strex id x  (passes)


ldrex id x
...
ldrex id y
...
strex id x (pass)
...
strex id y (fail)


ldrex id x
... 
ldrex id y
...
strex id y (pass)
...
strex id x (fail)

and so on.

Obviously the logic cannot store history for an infinite number of addresses and process ids, so naturally if the ...

ldrex id x
...
strex id x

has a ton of accesses in between. Then you can expect a failure from time to time.

Also note that I think one or more of the cortex-ms does not support ldrex/strex in the arm logic.

Well, okay, there is this language for example:

The Cortex-M3 processor implements a local exclusive monitor. The local monitor within the processor has been constructed so that it does not hold any physical address, but instead treats any access as matching the address of the previous LDREX. This means that the implemented exclusives reservation granule is the entire memory address range.

Which I also see in other cortex-ms.

more text from the documentation to ponder.

The Load-Exclusive instruction always successfully reads a value from memory address x.

The corresponding Store-Exclusive instruction succeeds in writing back to memory address x only if no other processor or process has performed a more recent store of address x. The Store-Exclusive operation returns a status bit that indicates whether the memory write succeeded.

For memory regions that do not have the shareable attribute, the exclusive access instructions rely on a local monitor that tags any address from which the processor executes a Load-Exclusive. Any non-aborted attempt by the same processor to use a Store-Exclusive to modify any address is guaranteed to clear the tag.

Notice how processor and/or process is used and not a term like thread. Also note the comment about store exclusive and not store in general. So while experimenting you should also:

ldrex
...
str
...
strex

and see what happens.

How is this a guarantee a value has been atomically updated in ARM?

1 Answers1