21

I've been working on an embedded OS for ARM, However there are a few things i didn't understand about the architecture even after referring to ARMARM and linux source.

Atomic operations.

ARM ARM says that Load and Store instructions are atomic and it's execution is guaranteed to be complete before interrupt handler executes. Verified by looking at

arch/arm/include/asm/atomic.h :
    #define atomic_read(v)  (*(volatile int *)&(v)->counter)
    #define atomic_set(v,i) (((v)->counter) = (i))

However, the problem comes in when i want to manipulate this value atomically using the cpu instructions (atomic_inc, atomic_dec, atomic_cmpxchg etc..) which use LDREX and STREX for ARMv7 (my target).

ARMARM doesn't say anything about interrupts being blocked in this section so i assume an interrupt can occur in between the LDREX and STREX. The thing it does mention is about locking the memory bus which i guess is only helpful for MP systems where there can be more CPUs trying to access same location at same time. But for UP (and possibly MP), If a timer interrupt (or IPI for SMP) fires in this small window of LDREX and STREX, Exception handler executes possibly changes cpu context and returns to the new task, however the shocking part comes in now, it executes 'CLREX' and hence removing any exclusive lock held by previous thread. So how better is using LDREX and STREX than LDR and STR for atomicity on a UP system ?

I did read something about an Exclusive lock monitor, so I've a possible theory that when the thread resumes and executes the STREX, the os monitor causes this call to fail which can be detected and the loop can be re-executed using the new value in the process (branch back to LDREX), Am i right here ?

sgupta
  • 1,214
  • 2
  • 15
  • 29

2 Answers2

19

The idea behind the load-linked/store-exclusive paradigm is that if if the store follows very soon after the load, with no intervening memory operations, and if nothing else has touched the location, the store is likely to succeed, but if something else has touched the location the store is certain to fail. There is no guarantee that stores will not sometimes fail for no apparent reason; if the time between load and store is kept to a minimum, however, and there are no memory accesses between them, a loop like:

do
{
  new_value = __LDREXW(dest) + 1;
} while (__STREXW(new_value, dest));

can generally be relied upon to succeed within a few attempts. If computing the new value based on the old value required some significant computation, one should rewrite the loop as:

do
{
  old_value = *dest;

  new_value = complicated_function(old_value);
} while (CompareAndStore(dest, new_value, old_value) != 0);

... Assuming CompareAndStore is something like:

uint32_t CompareAndStore(uint32_t *dest, uint32_t new_value, uint_32 old_value)
{
  do
  {
    if (__LDREXW(dest) != old_value) return 1; // Failure
  } while(__STREXW(new_value, dest);
  return 0;
}

This code will have to rerun its main loop if something changes *dest while the new value is being computed, but only the small loop will need to be rerun if the __STREXW fails for some other reason [which is hopefully not too likely, given that there will only be about two instructions between the __LDREXW and __STREXW]

Addendum An example of a situation where "compute new value based on old" could be complicated would be one where the "values" are effectively a references to a complex data structure. Code may fetch the old reference, derive a new data structure from the old, and then update the reference. This pattern comes up much more often in garbage-collected frameworks than in "bare metal" programming, but there are a variety of ways it can come up even when programming bare metal. Normal malloc/calloc allocators are not generally thread-safe/interrupt-safe, but allocators for fixed-size structures often are. If one has a "pool" of some power-of-two number of data structures (say 255), one could use something like:

#define FOO_POOL_SIZE_SHIFT 8
#define FOO_POOL_SIZE (1 << FOO_POOL_SIZE_SHIFT)
#define FOO_POOL_SIZE_MASK (FOO_POOL_SIZE-1)

void do_update(void)
{
  // The foo_pool_alloc() method should return a slot number in the lower bits and
  // some sort of counter value in the upper bits so that once some particular
  // uint32_t value is returned, that same value will not be returned again unless
  // there are at least (UINT_MAX)/(FOO_POOL_SIZE) intervening allocations (to avoid
  // the possibility that while one task is performing its update, a second task
  // changes the thing to a new one and releases the old one, and a third task gets
  // given the newly-freed item and changes the thing to that, such that from the
  // point of view of the first task, the thing never changed.)

  uint32_t new_thing = foo_pool_alloc();
  uint32_t old_thing;
  do
  {
    // Capture old reference
    old_thing = foo_current_thing;

    // Compute new thing based on old one
    update_thing(&foo_pool[new_thing & FOO_POOL_SIZE_MASK],
      &foo_pool[old_thing & FOO_POOL_SIZE_MASK);
  } while(CompareAndSwap(&foo_current_thing, new_thing, old_thing) != 0);
  foo_pool_free(old_thing);
}

If there will not often be multiple threads/interrupts/whatever trying to update the same thing at the same time, this approach should allow updates to be performed safely. If a priority relationship will exist among the things that may try to update the same item, the highest-priority one is guaranteed to succeed on its first attempt, the next-highest-priority one will succeed on any attempt that isn't preempted by the highest-priority one, etc. If one was using locking, the highest-priority task that wanted to perform the update would have to wait for the lower-priority update to finish; using the CompareAndSwap paradigm, the highest-priority task will be unaffected by the lower one (but will cause the lower one to have to do wasted work).

supercat
  • 77,689
  • 9
  • 166
  • 211
  • I've been doing exactly the same thing but the part where significant computing is required for the new value, puzzles me still. Using cmxchg loop makes sense because then the exclusive monitor won't be cleared by a context switch but re-doing the significant computing requires a lot of overhead as I've observed street to fail with no apparent reasons (UP with IRQs masked in PSR) as mentioned in your post. – sgupta Mar 13 '13 at 01:59
  • @user1075375: See addendum – supercat Mar 13 '13 at 15:48
  • These (__LDREXW & __STREXW) are intrinsics supported in the Keil compilers for Cortex-M series microcontroler-level processors, not generally available for mainstream ARM targets (e.g., AArch64) and compilers (e.g., gcc, llvm) right? http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0552a/BABDEEJC.html – ahcox Feb 05 '15 at 19:20
  • @ahcox: ARM recommends that all compiler vendors targeting the Cortex-M3 series support those intrinsics, but can't force them; I would expect that GCC should support them, but I don't really know. As for other processors, I know the Cortex-M0 does not support those operations, but I would expect more advanced processors to do so. Generally I would suggest that one confine use of them to small methods like "atomic increment" and such, which could easily be rewritten if needed to use other approaches (e.g. on the Cortex-M0, they'd be implemented by temporarily disabling interrupts). – supercat Feb 05 '15 at 19:42
  • 1
    Thanks @supercat. Just now looking at some AArch64 disassembly and see that the GCC builtin `__sync_add_and_fetch(address, 1)` generates optimal code using these ops. Like this: `top:` `ldaxr w1, [x0]` `add w1, w1, #0x1` `stlxr w2, w1, [x0]` `cbnz w2, top` (sorry for messy formatting). – ahcox Feb 05 '15 at 19:51
12

Okay, got the answer from their website.

If a context switch schedules out a process after the process has performed a Load-Exclusive but before it performs the Store-Exclusive, the Store-Exclusive returns a false negative result when the process resumes, and memory is not updated. This does not affect program functionality, because the process can retry the operation immediately.

sgupta
  • 1,214
  • 2
  • 15
  • 29
  • 1
    This is safe as long as your OS uses `clrex` or a dummy `strex` inside the context-switch, or itself uses a LDREX / STREX before returning to user-space. Otherwise you could get a false *positive* if you switched from one LL/SC retry loop to the middle of another. The CPU could see the LDREX in one loop and the STREX in the other and if no invalidations came in between them then it could succeed on simplistic implementations. – Peter Cordes Jan 10 '19 at 13:48
  • Great answer, although the link is not really helpful (anymore). That specific page has probably been moved. Maybe [this link](https://developer.arm.com/documentation/dui0646/a/The-Cortex-M7-Processor/Memory-model/Synchronization-primitives?lang=en) is more useful? It seems to contain the relevant information. – wovano Jul 28 '22 at 09:06
  • @PeterCordes, can you elaborate on that? The documentation states that the exclusive access tag is removed if an exception occurs. So (since task switches are generally implemented using exceptions), that should only result in false negatives, not false positives, right? – wovano Jul 28 '22 at 09:08
  • 1
    @wovano: [When is CLREX actually needed on ARM Cortex M7?](https://stackoverflow.com/q/51162344) says it's potentially necessary on Cortex-M. I haven't read the ARM manuals myself; maybe it's different on some models. Or maybe "exception" means synchronous, like a page fault, not including external interrupts like a timer interrupt. I don't know the precise meaning of the terminology in an ARM context. – Peter Cordes Jul 28 '22 at 16:55
  • @PeterCordes, thanks for the link, that's interesting. Apparently there is a lot of confusion about these instructions. And there might indeed be differences between the different ARM architectures. For Cortex-M7, I'm pretty sure that interrupts (as well as PendSV, which is intended for context-switching) are all [types of exceptions](https://developer.arm.com/documentation/dui0646/c/The-Cortex-M7-Processor/Exception-model/Exception-types?lang=en), so I'd say the LDREX/STREX pattern is guaranteed to work correctly on that architecture. – wovano Jul 28 '22 at 18:06