Partial reordering of C++11 atomics on Aarch64

Question

I was looking at the compiler output of rmw atomics from gcc and noticed something odd - on Aarch64, rmw operations such as fetch_add can be partially reordered with relaxed loads.

On Aarch64, the following code may be generated for value.fetch_add(1, seq_cst)

.L1:
    ldaxr x1, [x0]
    add x1, x1, 1
    stlxr w2, x1, [x0]
    cbnz L1

However, it's possible for loads and stores that happen prior to ldaxr to be reordered past the load and loads/stores that happen after the stlxr (see here). GCC doesn't add fences to prevent this - Here's a small piece of code demonstrating this:

void partial_reorder(std::atomic<uint64_t> loader, std::atomic<uint64_t> adder) {
    loader.load(std::memory_order_relaxed); // can be reordered past the ldaxr
    adder.fetch_add(1, std::memory_order_seq_cst);
    loader.load(std::memory_order_relaxed); // can be reordered past the stlxr
}

generating

partial_reorder(std::atomic<int>, std::atomic<int>):
    ldr     w2, [x0] @ reordered down
.L2:
    ldaxr   w2, [x1]
    add     w2, w2, 1
    stlxr   w3, w2, [x1]
    cbnz    w3, .L2
    ldr     w0, [x0] @ reordered up
    ret

In effect, the loads can be partially reordered with the RMW operation - they occur in the middle of it.

So, what's the big deal? What am I asking?

It seems strange that an atomic operation is divisible as such. I couldn't find anything in the standard preventing this, but I had believed that there was a combination of rules that implied operations are indivisible.
It seems like this doesn't respect acquire ordering. If I perform a load directly after this operation, I could see store-load or store-store reordering between the fetch_add and the later operation, meaning that the later memory access is at least partially reordered behind the acquire operation. Again, I couldn't find anything in the standards explicitly saying that isn't allowed and acquire is load ordering, but my understanding was that the acquire operation applied to the entirety of the operation and not just parts of it. A similar scenario can apply to release where something is reordered past the ldaxr.
This one is may be stretching the ordering definitions a bit more, but it seems invalid that two operations before and after a seq_cst operation can be reordered past each other. This could(?) happen if the bordering operations each reorder into the middle of the operation, and then go past each other.

Possibly related: http://bigflake.com/seq_cst.cpp shows compiler output for x64/arm32/aarch64 for basic atomic load/store. `store(val, std::memory_order_seq_cst)` appears to be missing a barrier. This is with gcc 4.9 on Android. — fadden, Jul 05 '16 at 22:41
@fadden: It's actually fine, see https://stackoverflow.com/questions/65466840/arm-stlr-memory-ordering-semantics/65473798#65473798. The AArch64 `ldar/stlr` instructions actually have stronger ordering semantics than their names suggest. Specifically, an earlier `stlr` is guaranteed not to be reordered after a later `ldar`, even though ordinary acquire/release semantics would allow it. This means that they can be used as `seq_cst` without further barriers. Later ARMv8 versions add `ldapr` which does allow such reordering and gets you back to classic acquire/release. — Nate Eldredge, Jul 12 '23 at 06:18

Tsyvarev · Answer 1 · 2016-07-06T06:47:19.213

6

Looks like you are right. At least, very similar bug for gcc has been accepted and fixed.

They provide this code:

.L2:
    ldaxr   w1, [x0]       ; load-acquire (__sync_fetch_and_add)
    add w1, w1, 1
    stlxr   w2, w1, [x0]   ; store-release  (__sync_fetch_and_add)
    cbnz    w2, .L2

So previous operations can be reordered with ldaxr and futher operations can be reordered with stlxr, which breaks C++11 confirmance. Documentation for barriers on aarch64 clearly explains, that such reordering is possible.

edited Jul 06 '16 at 06:47

answered Feb 05 '16 at 07:14

Tsyvarev

60,011
17
110
153

While I can't come up with a scenario where the first load-load reordering matters, the second store-bought reordering may matter - epoch or hazard-pointer like techniques often require store-bought reordering between marking the thread as active and loading a pointer. In this case, if the second load was a pointer, it could load the pointer before committing the store as 'active'. This is valid if the acquire and release in acq_rel only apply to parts of an rmw operation. That seems like an uncharitable detail of the standard as it means atomics are visibly non-atomic, in a sense. – Sam S Feb 05 '16 at 14:47
On other weakly ordered architectures, the acquire-release ordering are archieved with explicit fences (dmb around ldrex and strex for arm) which prevents this sort of internal reordering. Also, I still don't see how allowing operations to reorder around a seq_cst operation, which is stronger than acq_rel, is valid. – Sam S Feb 05 '16 at 14:49
`it could load the pointer before committing the store as 'active'.` - Again, "commiting" is done in `stlxr` call, which cannot be reordered with futher loads. `cbnz` just checks that commiting has been succeed, its behavior isn't observable by other threads. As for `seq_cst` ordering, I believe that `stlxr` instruction also provides global visibility garantee on given architecture. It could be another question for Stack Overflow though. – Tsyvarev Feb 05 '16 at 17:44
Actually, stlrx can be reordered with further loads and stores - I can't get links to the arm docs to work properly, but they are very explicit about this. Stlrx only has store-release ordering, which means that future load/store operations can be reordered behind it. If stlrx provided store-load and store-store ordering, than this question would be moot. – Sam S Feb 05 '16 at 18:46
I think the doc you want is: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/CHDCJBGA.html (explains LDAR/STLR). – fadden Jul 05 '16 at 22:58
@fadden: Thanks for the link, I have added it into the answer. – Tsyvarev Jul 06 '16 at 06:47

score 1 · Answer 2 · answered Jul 12 '23 at 06:13

I asked the same question in For purposes of ordering, is atomic read-modify-write one operation or two?, not knowing that it was a duplicate.

You're right that this means another load or store can be reordered "into the middle" of an atomic RMW. I don't think this is a bug, though.

Since nearly all of the C++ memory model is defined in terms of loads and stores, I believe (others may disagree) that we must treat an atomic read-modify-write as a pair consisting of one load and one store. Its "atomic" nature comes from [atomics.order p10] (in C++20), that the load must see the value that immediately precedes, in the modification order, the value written by the store.

Effectively, this means that no other accesses to loader itself can occur between the read and the write. But accesses to other variables are fair game, limited only by the barriers. Acquire ordering doesn't forbid the load of the RMW from being reordered with an earlier relaxed operation, so such reordering is legitimate.

If your code needs to avoid such reordering, then you have to strengthen your barriers: the first loader.load() needs to be acquire or stronger, and the second one needs to be seq_cst.

if your question is a duplicate, you should close it as one and then mod-flag for a merge to happen if there are good answers in the dup. — starball, Jul 12 '23 at 08:21

Partial reordering of C++11 atomics on Aarch64

2 Answers2

Linked