How do ldrex / strex make atomic_add in ARM an atomic operation?

Question

As per http://lxr.free-electrons.com/source/arch/arm/include/asm/atomic.h#L31

 static inline void atomic_add(int i, atomic_t *v)
 41 {
 42         unsigned long tmp;
 43         int result;
 44 
 45         prefetchw(&v->counter);
 46         __asm__ __volatile__("@ atomic_add\n"
 47 "1:     ldrex   %0, [%3]\n"
 48 "       add     %0, %0, %4\n"
 49 "       strex   %1, %0, [%3]\n"
 50 "       teq     %1, #0\n"
 51 "       bne     1b"
 52         : "=&r" (result), "=&r" (tmp), "+Qo" (v->counter)
 53         : "r" (&v->counter), "Ir" (i)
 54         : "cc");
 55 }

How can it be called "atomic" when it can be preempted?

Do you understand how `ldrex` and `strex` function? I can find some good references if you want to learn. They're the magic that makes this atomic while still being preemptable. Atomic: yes, deterministic: no. — rjp, May 19 '14 at 13:51
suppose two threads are trying to call the same function one succeeds and updates the value, after that another thread (which was looping) will succeed and add another value in effect the result of second is not correct. — , May 19 '14 at 14:12
It sounds like you actually want a local copy of the memory you are adding to. The addition is atomic, but the value to which you are adding may change before the addition takes place. @gnasher729's answer does address this. — rjp, May 19 '14 at 16:42

score 5 · Answer 1 · answered May 19 '14 at 09:37

5

You seem to be totally misunderstanding what an atomic operation is.

It should be obviously that if you look at a value x, see the value is 13, then call an atomic_add function that increases it by 5, the new result could be anything, because another thread could have changed the value before atomic_add was called. Likewise, if you check the result, it could again be anything because another thread could change the result between the atomic_add and your checking.

An atomic_add function guarantees to leave the value increased by that amount. And that's what it does. How this is achieved doesn't matter. If hundred threads call atomic_add (5, &x), then x will end up increased by 500. That's what matters.

That's the typical method how atomic operations are performed on processors like ARM and PowerPC that have an instruction that reserves a memory location and a store instruction checking that a reservation still exists.

answered May 19 '14 at 09:37

gnasher729

51,477
5
75
98

may be you are right, I do now understand the atomic add. An atomic add operation means AFAIK complete in one cycle such that the preemption or modification of the value from another context precedes before the addition itself. If an atomic operation takes a lock using another vairable, how it is different from doing simple add in a lock unlock critical section. – May 19 '14 at 09:44
4

Atomic does not mean "in one cycle". And most of the time atomic instructions will take a lot more than a single cycle, because of the bookkeeping overhead required. What it means is that your value gets changed in a way that the thing you want to happen actually happens. If another thread would change the value between load and store your result would normally corrupted in some way. An atomic operation guarantees that the operation completes as block. With a critical section you just shift the atomic part into the lock instruction. So in the end you'll get overhead with no gain. – Nico Erfurth May 19 '14 at 11:35

score 0 · Answer 2 · edited Apr 08 '22 at 14:34

0

The ldrex and strex take a different tactic for implementing atomic instructions. The traditional compare and swap (CAS for short) has limitations in lock-free programming. The load link/store conditional (LL/SC for short) is a more advanced form of atomic instruction that allows atomic linked lists; compare them and think about how they might be implemented in silicon.

For the traditional ARM, the swp and swpb instructions provided an atomic mechanism. This seems obvious as the instruction stands by itself. The complication is in multi-cpu designs. The CPU running the swp must lock the bus so that other CPUs can not read or write the memory as the update is being performed. Typically, this would be the whole BUS and not just a single location. The whole BUS mechanism means that Adhmal's law applies.

As per Masta79, Atomic does not mean "in one cycle", is a distinguishing feature of the ARM ldrex/strex atomic support. A particular reserve granule is locked on the ARM bus for the duration of the ldrex/strex pair. This allows for many different complex lock free primitives; many valid instructions are permitted between the ldrex/strex pairs. However, it come with the additional complication that the strex supports a retry status via the condition code being set. This is a definite shift in mind-set from traditional atomic operations. With specific reservations, it is possible for each CPU to make progress, if they are not trying to lock the same reserve granual (a specific chunk of memory) at the same time. This should help in avoiding the Adhmal road block (aka memory bandwidth with bus locking).

how can it be called as atomic when it can be preempted?

An important feature of the code is prefetchw(&v->counter);, which brings the value into the cache. The cache is treated as a temporary buffer and a successful strex will commit it. Only the cached value is modified until the strex; if the strex finds it is dirty (another committed it), then the value is thrown away. An interrupt on the same CPU will do a clrex, which also invalidates the data and makes the strex retry.

edited Apr 08 '22 at 14:34

Peter Cordes

328,167
45
605
847

answered May 20 '14 at 16:16

artless noise

21,212
6
68
105

Here is [an EDN article](http://edn.com/electronics-news/4368122/ARM-launches-AMBA-AXI-coherent-extensions) to help explain why the *reserve granules* are better. – artless noise May 22 '14 at 22:08
While ldrex/strex combination in some cases avoids the ABA problem which can cause trouble in some compare-and-store scenarios, the fact that operations are allowed to spontaneously fail means that, in practice, it's often necessary to use ldrex/strex for as building blocks for CAS or something similar, and then build more complicated algorithms upon that. It's possible to use ldrex/strex for things that are slightly trickier than CAS which could add resistance to ABA scenarios, but ldrex/strex are generally not as powerful as they might seem. – supercat Feb 05 '15 at 19:52
@supercat Mostly true. *...generally not as powerful...*. This is to think of them as CAS; they are different. [Practical lock-freedom by Keir Fraser](http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-579.pdf) has a good overview and I think that the LDREX/STREX are more STM (Software transactional memory (sec 3.3)). So I don't think *not as powerful* applies. It is just that people are much more use to CAS as it has been around for longer (M68k, etc). People think ldrex/strex are like ARM swp; they are quite different. – artless noise Feb 05 '15 at 20:08
A problem with ldrex/strex is that from what I understand there's no guarantee as to how much code can separate them without causing an excessive probability of strex failure. On some platforms one can probably get away with a lot, but without some concrete assurance that an ldrex/strex will succeed within two tries unless some other thread performs a successful store to the memory location in question, I don't see how one could avoid live-lock and thrashing. – supercat Feb 05 '15 at 20:16
You need to structure your algorithm correctly (like the paper I referenced goes into detail about). For instance, you can use the [spinlock](http://lwn.net/Articles/163215/) mechanics. Use a counter to see if you can enter the 'ldrex/strex' sequence. The counter itself is also 'ldrex/strex', but will only block for the amount of lockers. New entries wait to contend in the next round. – artless noise Feb 05 '15 at 20:43
2

If the implementation of ldrex/strex were such that an ldrex/strex pair would always succeed unless the location being monitored were externally written, then one could do a lot between the ldrex and strex if other threads were unlikely to write to the guarded location in the interim. Not all processors are that fancy, however, and some processors have to service interrupts very frequently. If a system seldom goes more than 300 cycles between interrupts, and the computation between an ldrex and strex takes 320 cycles, it may spin indefinitely. – supercat Feb 05 '15 at 21:49
Is it really true that ARM CPUs made `swp` work "in one cycle"? How is that even possible if we're talking about atomicity wrt. DMA or other bus-master devices? Also, efficient atomic RMW operations are possible without locking the whole bus, just hang onto that cache line, delaying response to MESI requests so you keep it in exclusive / modified state. (That's [what modern x86 CPUs do](https://stackoverflow.com/q/39393850) for `xchg` and `lock add`, and I assume what modern ARMv8.1 CPUs do for their single-instruction atomic-RMW operations that scale better to huge numbers of cores.) – Peter Cordes Apr 08 '22 at 03:19
@PeterCordes I meant a single instruction, not a cycle or single clock (updated). `swp` will lock the bus for a read/write. As you note this is a main advantage of `ldrex/strex` as for Cortex-A CPUs it will use a cache line. The ARM nomenclature is a little bizarre as for Cortex-M, there is no cache and the 'reserve granule' is the entire address space. You need to have cache active on Cortex-A, but not on Cortex-M to use `ldrex/strex`. – artless noise Apr 08 '22 at 14:19
There were never any ARM cores that made `swp` on cacheable memory efficient? Was it removed by the time ARMv8.1 added instructions with similar RMW semantics? No reason it couldn't be, for the same reason that x86 is able to make `xchg` efficient. (Throughput scales linearly with number of threads, as long as each thread is exchanging in a different cache line. The internal load and store + other uops it decodes to presumably lock / unlock a cache line, instead of arming a monitor and doing a conditional store.) Unless `swp` has some extra guarantee of actually bus-locking. – Peter Cordes Apr 08 '22 at 14:26
@PeterCordes Well, I was not comparing x86, I was trying to answer the original question of 'how is this atomic if it can be pre-empted?'. The answer is, it uses a cache as a 'scratch area' to perform the computation. – artless noise Apr 08 '22 at 14:36
Well, I already understood LL/SC, and just came here looking for possible duplicates for [ldrex and strex atomicity when pre-empted by another thread?](https://stackoverflow.com/q/71791070) that explained exactly when `clrex` was required. But I did upvote this answer after looking it over; I only had a nitpick about `swp`. (And I'd previously read that `swp` was deprecated but hadn't ever found out if it was really slow (ARM CPUs not implementing it efficiently for some reason), or just because it must increase worst-case interrupt latency vs. an LL/SC that can always bail out.) – Peter Cordes Apr 08 '22 at 14:37
I was comparing x86 and ARMv8.1 single-instruction RMWs because that's the obvious comparison *for `swp`* (not for ldrex / strex). – Peter Cordes Apr 08 '22 at 14:39
Understand that. My gripe is that the other answer does nothing to actually answer the question directly about how it is accomplished. It just says that atomic and pre-emption are not the same thing, not how atomic and pre-emption can work together. – artless noise Apr 08 '22 at 14:48

How do ldrex / strex make atomic_add in ARM an atomic operation?

2 Answers2

Linked