Are relaxed atomic store reordered themselves before the release? (similar with load /acquire)

Question

I read in the en.cppreference.com specifications relaxed operations on atomics:

"[...]only guarantee atomicity and modification order consistency."

So, I was asking myself if such 'modification order' would work when you are working on the same atomic variable or different ones.

In my code I have an atomic tree, where a low priority, event based message thread fills which node should be updated storing some data on red '1' atomic (see picture), using memory_order_relaxed. Then it continues writing in its parent using fetch_or to know which child atomic has been updated. Each atomic supports up to 64 bits, so I fill the bit 1 in red operation '2'. It continues successively until the root atomic which is also flagged using the fetch_or but using this time memory_order_release.

Then a fast, real time, unblockable, thread loads the control atomic (with memory_order_acquire) and reads which bits have it enabled. Then it updates recursively the childs atomics with memory_order_relaxed. And that is how I sync my data with each cycle of the high priority thread.

Since this thread is updating, it is fine child atomics are being stored before its parent. The problem is when it stores a parent (filling the bit of the children to update) before I fill the child information.

In other words, as the tittle says, are the relaxed stores reordered between them before the release one? I don't mind non-atomic variables are reordered. Pseudo-code, suppose [x, y, z, control] are atomic and with initial values 0:

Event thread:
z = 1; // relaxed
y = 1; // relaxed
x = 1; // relaxed;
control = 0; // release

Real time thread (loop):
load control; // acquire
load x; // relaxed
load y; // relaxed
load z; // relaxed

I wonder if in the real time thread this would be true always: x <= y <=z. To check that I wrote this small program:

#define _ENABLE_ATOMIC_ALIGNMENT_FIX 1
#include <atomic>
#include <iostream>
#include <thread>
#include <assert.h>
#include <array>

using namespace std;
constexpr int numTries = 10000;
constexpr int arraySize = 10000;
array<atomic<int>, arraySize> tat;
atomic<int> tsync {0};

void writeArray()
{
    // Stores atomics in reverse order
    for (int j=0; j!=numTries; ++j)
    {
        for (int i=arraySize-1; i>=0; --i)
        {
            tat[i].store(j, memory_order_relaxed);
        }
        tsync.store(0, memory_order_release);
    }
}

void readArray()
{
    // Loads atomics in normal order
    for (int j=0; j!=numTries; ++j)
    {
        bool readFail = false;
        tsync.load(memory_order_acquire);

        int minValue = 0;
        for (int i=0; i!=arraySize; ++i)
        {
            int newValue = tat[i].load(memory_order_relaxed);
            // If it fails, it stops the execution
            if (newValue < minValue)
            {
                readFail = true;
                cout << "fail " << endl;
                break;
            }
            minValue = newValue;
        }

        if (readFail) break;
    }
}


int main()
{
    for (int i=0; i!=arraySize; ++i)
    {
        tat[i].store(0);
    }

    thread b(readArray);
    thread a(writeArray);

    a.join();
    b.join();
}

How it works: There is an array of atomic. One thread stores with relaxed ordering in reverse order and ends storing a control atomic with release ordering.

The other thread loads with acquire ordering that control atomic, then it loads with relaxed that atomic the rest of values of the array. Since the parents mustn't be updates before the children, the newValue should always be equal or greater than the oldValue.

I've executed this program on my computer several times, debug and release, and it doesn't trig the fail. I'm using a normal x64 Intel i7 processor.

So, is it safe to suppose that relaxed stores to multiple atomics do keep the 'modification order' at least when they are being sync with a control atomic and acquire/release?

The quote about relaxed giving *modification order consistency.* only means that all threads can agree on a modification order for *that* one object. i.e. an order exists. — Peter Cordes, May 01 '20 at 10:14

Chris Hall · Accepted Answer · 2020-05-01T13:24:09.490

4

Sadly, you will learn very little about what the Standard supports by experiment with x86_64, because the x86_64 is so well-behaved. In particular, unless you specify _seq_cst:

all reads are effectively _acquire
all writes are effectively _release

unless they cross a cache-line boundary. And:

all read-modify-write are effectively seq_cst

Except that the compiler is (also) allowed to re-order _relaxed operations.

You mention using _relaxed fetch_or... and if I understand correctly, you may be disappointed to learn that is no less expensive than seq_cst, and requires a LOCK prefixed instruction, carrying the full overhead of that.

But, yes _relaxed atomic operations are indistinguishable from ordinary operations as far as ordering is concerned. So yes, they may be reordered wrt to other _relaxed atomic operations as well as not-atomic ones -- by the compiler and/or the machine. [Though, as noted, on x86_64, not by the machine.]

And, yes where a release operation in thread X synchronizes-with an acquire operation in thread Y, all writes in thread X which are sequenced-before the release will have happened-before the acquire in thread Y. So the release operation is a signal that all writes which precede it in X are "complete", and when the acquire operation sees that signal Y knows it has synchronized and can read what was written by X (up to the release).

Now, the key thing to understand here is that simply doing a store _release is not enough, the value which is stored must be an unambiguous signal to the load _acquire that the store has happened. For otherwise, how can the load tell ?

Generally a _release/_acquire pair like this are used to synchronize access to some collection of data. Once that data is "ready", a store _release signals that. Any load _acquire which sees the signal (or all loads _acquire which see the signal) know that the data is "ready" and they can read it. Of course, any writes to the data which come after the store _release may (depending on timing) also be seen by the load(s) _acquire. What I am trying to say here, is that another signal may be required if there are to be further changes to the data.

Your little test program:

initialises tsync to 0
in the writer: after all the tat[i].store(j, memory_order_relaxed), does tsync.store(0, memory_order_release)

so the value of tsync does not change !
in the reader: does tsync.load(memory_order_acquire) before doing tat[i].load(memory_order_relaxed)

and ignores the value read from tsync

I am here to tell you that the _release/_acquire pairs are not synchronizing -- all these stores/load may as well be _relaxed. [I think your test will "pass" if the the writer manages to stay ahead of the reader. Because on x86-64 all writes are done in instruction order, as are all reads.]

For this to be a test of _release/_acquire semantics, I suggest:

initialises tsync to 0 and tat[] to all zero.
in the writer: run j = 1..numTries

after all the tat[i].store(j, memory_order_relaxed), write tsync.store(j, memory_order_release)

this signals that the pass is complete, and that all tat[] is now j.
in the reader: do j = tsync.load(memory_order_acquire)

a pass across tat[] should find j <= tat[i].load(memory_order_relaxed)

and after the pass, j == numTries signals that the writer has finished.

where the signal sent by the writer is that it has just completed writing j, and will continue with j+1, unless j == numTries. But this does not guarantee the order in which tat[] are written.

If what you wanted was for the writer to stop after each pass, and wait for the reader to see it and signal same -- then you need another signal and you need the threads to wait for their respective "you may proceed" signal.

edited May 01 '20 at 13:24

answered May 01 '20 at 12:50

Chris Hall

1,707
1
4
14

2

Your answer makes it sound like compile-timer ordering can't happen when compiling for x86. It can, it just often doesn't and the scope is often limited by the structure of the source. So yes agreed with your point about testing on x86 telling you very little, but compiling for x86 doesn't make it truly safe to use `relaxed` where you should have used `release`. https://preshing.com/20120625/memory-ordering-at-compile-time/ – Peter Cordes May 01 '20 at 12:55
@PeterCordes: edited. It would stupid to write *_relaxed* where *_acquire*/*_release* is required, even if one were writing exclusively for x86_64, and I would not want to lead anybody down that primrose path ! In the C (with which I am more familiar) the `_explicit()` loads, stores etc. have a `volatile` qualified atomic* parameter, which I read means that operation should "be evaluated strictly according to the rules of the abstract machine". – Chris Hall May 01 '20 at 13:58
Yup, good edit. Re: `volatile _Atomic *` in the C interface: Not exactly. That just means you *can* pass a pointer to a `volatile _Atomic` if you happen to have one. In this case it does *not* mean that the actual atomic operation is a `volatile` access in the ISO C standard if you pass a non-`volatile` pointer. In practice compilers don't optimize atomics for the same reason as in C++ ([Why don't compilers merge redundant std::atomic writes?](https://stackoverflow.com/q/45960387)): until the standard defines good rules to control it, it's good QoI not to merge them or whatever. – Peter Cordes May 01 '20 at 14:05
I forget if GCC or clang will in practice reorder relaxed loads or stores relative to source order, across other relaxed atomic ops. Certainly across non-atomic accesses, but the implementation might end up treating every atomic as actually `volatile` and be ordered (at compile time) wrt non-atomic `volatile` as well. – Peter Cordes May 01 '20 at 14:08
@PeterCordes: I find this in the C18 Standard "*NOTE Many operations are volatile-qualified. The “volatile as device register” semantics have not changed in the standard. This qualification means that volatility is preserved when applying these operations to volatile objects.*" I don't have a C++ Standard, but I did find what looks like the same note in a draft for C++11, but it continues: "*It does not mean that operations on non-volatile objects become volatile.*". Perhaps the C guys thought that the extra sentence ran the risk of being altogether too prescriptive (or clear, even) ? – Chris Hall May 01 '20 at 15:14
IDK why the difference. If anything C needs that "does no mean" clarification *more* because the functions straight up take `volatile` pointers. In C++ you have member functions on regular objects. I don't think the clarification wording is "prescriptive" ; an implementation is free to make their atomic library function stronger if they want. – Peter Cordes May 01 '20 at 15:21
@ChrisHall thanks for your reply! Yes it's fine that any write after the store/release may be seen by the load/acquire. However something disturbs me: is it necessary to use the return value from load(std::mo_acquire) to synchronize? could instead be used an exchange(std::acq_rel) without using the return value? In my real program I don't know how to use that value. – Juan JuezSarmiento May 01 '20 at 16:57
Fundamental question: when you do the acquire in one thread how do you know whether the release has or has not happened in the other ? Secondary question: if the value written by the release is not necessary for that, why bother with the release ? Or, for that matter, the acquire ! Incidentally: an exchange(_acq_rel) will of course do the acquire part... but that still needs to have a way of knowing that (or if) the release has happened in the other thread. – Chris Hall May 01 '20 at 17:36
@ChrisHall release is caused by user input (keys, mouse buttons...), then the next cycle of the fast looped acquire thread would update/sync the needed data as soon as it detects the root item of the tree (atomic relaxed) changed. I was using the 'release' as something to indicate where are the sync points, which release variables should be visible in the acquire thread. I think I could do a condition to update instead. Something like `if (bool oldv = tsync.exchange(false, acq_rel); oldv != false) {checkAndUpdateRootNode();}`. But, is it really need? (not as easy as I exposed) – Juan JuezSarmiento May 01 '20 at 18:07
1

It looks as though you are here using `tsync.exchange` to pick up and clear a signal that there is work to do. Another thread can `tysnc.store(true, mo_release)` to signal that it has completed preparing more work. But I do not understand enough about what you are trying to do to know whether that is sufficient. – Chris Hall May 01 '20 at 18:34
It's just reading in a relaxed way all these variables that may be changed by user action. The cyclic thread is continously looking for flags to update its own internal data. Thank you for your time and effort! :) – Juan JuezSarmiento May 01 '20 at 18:37

score 0 · Answer 2 · answered May 01 '20 at 10:27

The quote about relaxed giving modification order consistency. only means that all threads can agree on a modification order for that one object. i.e. an order exists. A later release-store that synchronizes with an acquire-load in another thread will guarantee that it's visible. https://preshing.com/20120913/acquire-and-release-semantics/ has a nice diagram.

Any time you're storing a pointer that other threads could load and deref, use at least mo_release if any of the pointed-to data has also been recently modified, if it's necessary that readers also see those updates. (This includes anything indirectly reachable, like levels of your tree.)

On any kind of tree / linked-list / pointer-based data structure, pretty much the only time you could use relaxed would be in newly-allocated nodes that haven't been "published" to the other threads yet. (Ideally you can just pass args to constructors so they can be initialized without even trying to be atomic at all; the constructor for std::atomic<T>() is not itself atomic. So you must use a release store when publishing a pointer to a newly-constructed atomic object.)

On x86 / x86-64, mo_release has no extra cost; plain asm stores already have ordering as strong as release so the compiler only needs to block compile time reordering to implement var.store(val, mo_release); It's also pretty cheap on AArch64, especially if you don't do any acquire loads soon after.

It also means you can't test for relaxed being unsafe using x86 hardware; the compiler will pick one order for the relaxed stores at compile time, nailing them down into release operations in whatever order it picked. (And x86 atomic-RMW operations are always full barriers, effectively seq_cst. Making them weaker in the source only allows compile-time reordering. Some non-x86 ISAs can have cheaper RMWs as well as load or store for weaker orders, though, even acq_rel being slightly cheaper on PowerPC.)

Caveat: I only skimmed the question and answered based on what I thought you were going to ask based on the diagram. I ended up writing a longer answer than makes sense for that, but let me know if this doesn't actually answer the question. I was going to just comment after skimming, but it's best not to answer in comments ... having ADHD is fun sometimes :P — Peter Cordes, May 01 '20 at 10:29
TBH, I'm not sure. I know that link, see "Without Fences in Portable C++11" part, this is an extension of that. But without the classic while loop to check a control variable to load everything is below the acquire. Basically what I need to know if the 'fail' of my code would trigger in another computer. Because if it fails it means that in my other complex code it may fail too and leave some nodes without being updated! I know using rel/acq is the easy way to do it, however I'm really in an urge to save as much processing as possible. — Juan JuezSarmiento, May 01 '20 at 10:56
Those 'relaxed' variables are relaxed because they may be read in any moment, maybe the read thread doesn't get the lastest value until write thread release, but at least it will be race-free. What I really need the read loads the stores in the order they have been stored - I hope it is better explained now - and maybe that forces me to use release/acquire in every node except the last one... but the code doesn't fail! (I'm targetting normal PCs and Macs) — Juan JuezSarmiento, May 01 '20 at 11:09

Are relaxed atomic store reordered themselves before the release? (similar with load /acquire)

2 Answers2