Force an atomic store to occur before other operations

Question

If I have 2 threads that execute f() and f2(), like this:

std::atomic_flag flag; // a member variable in a class
std::mutex m;

flag.test_and_set(); // set to true before f() or f2()

// executed only once (a "stop" function)
void f() {
    flag.clear(std::memory_order_relaxed);

    {
        std::unique_lock<std::mutex> lock(m);
        // do something 
    }
    // join f2's thread
}

// long running function
void f2() {
    std::unique_lock<std::mutex> lock(m);

    while (flag.test(std::memory_order_relaxed)) {
        // do stuff
    }
}

f2() is generally executed first, and f() is supposed to stop f2().

It's important that flag.clear executes before the mutex is acquired, or else f() wouldn't return (in my actual case it's just that it would take a much longer time to finish).

So my question: Is relaxed memory ordering sufficient to guarantee that flag.clear() executes before locking the mutex?

In this question, a similar scenario is presented. From what I can understand, it's safe to use a relaxed memory order there because there are no dependencies on the flag other then joining the thread, and std::thread's join is a synchronizing operation. So therefore flag.clear() couldn't be moved after join.

However here, I use a std::mutex, but if I say switch to another implementation like this one: https://rigtorp.se/spinlock/, then would it still be safe to use a relaxed memory order? Because their lock only uses memory order acquire, isn't it technically possible for the locking of the mutex to be ordered entirely before flag.clear? As acquire only prevents operations after it from moving behind the acquire, but it doesn't say anything about stores moving after the acquire.

Would I need something like a StoreLoad barrier, like in here: https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/? This question seems to suggest that I need a full memory barrier as well.

Or alternatively, if the flag.clear() store operation could somehow have an "acquire" memory ordering, then it would prevent operations after from being reordered before the store. But this comment suggests that you need sequentially consistent ordering instead:

If you really need to prevent something from being reordered before a store, AFAIK your only recourse is to use seq_cst on both the store and the other operation. Nothing weaker suffices in the C++ memory model.

The mutex is a sequence point. While it is being locked, all reads and writes before it are guaranteed to finish regardless of other memory access models. All reads and writes after it are guaranteed not to start. — 273K, Jul 12 '23 at 02:31
@273K But in the case of the spin-lock mutex, `lock()` is just a bunch of atomic exchange operations using the acquire memory order, so wouldn't that not prevent the reordering of the store above it? — j0011, Jul 12 '23 at 02:52
If I understand correctly, the question is really about whether the visibility of the store could be *delayed* arbitrarily while `f()` is waiting for the lock to become available? Other than that, the ordering doesn't seem relevant. E.g. if `f()` managed to take the lock immediately, but the store became visible a short time later, that doesn't seem to cause any problems for your program. — Nate Eldredge, Jul 12 '23 at 03:12
If that's it, then it's really a "forward progress" question that is answered by [intro.progress p18]. Since the store has to become visible in finite time, the compiler would not be allowed to unconditionally move it past an operation like `lock()` which is not guaranteed to ever return. Note that memory ordering has nothing to do with this. — Nate Eldredge, Jul 12 '23 at 03:14
@NateEldredge "E.g. if f() managed to take the lock immediately" I think in the code it's impossible for `f()` to take the lock immediately, without first making the store, or else `f2()` would still be in the loop. — j0011, Jul 12 '23 at 03:19
Would not being able to take the lock immediately affect anything though? This atomicity stuff is making my brain hurt :( — j0011, Jul 12 '23 at 03:29
Oh I see, we're assuming that f2() already holds the lock before f() gets called, so having f() attempt to take the lock is an operation that will never complete. Then yes, the store will become visible in finite time regardless. — Nate Eldredge, Jul 12 '23 at 03:43
That's not the same thing as requiring the store to become visible before starting to try the lock, though. Imagine a simple implementation where `lock()` is a spin loop that repeatedly loads some atomic object. The implementation would be allowed to, say, perform 100 of those loads, and then make the store visible. But that would cause no harm to your program. — Nate Eldredge, Jul 12 '23 at 03:46
@NateEldredge That's what I'm trying to avoid though, because in my program acquiring the `lock()` can take an unreasonably long time (many seconds), and in this questions case, `lock()` will never be acquired unless the store happens first. — j0011, Jul 12 '23 at 03:49
In my example, each individual load takes an infinitesimal amount of time. The lock is acquired only when one of those loads returns a certain value, say `false`. So doing 100 loads before making the store visible does *not* take "many seconds", and the compiler is allowed to delay the store until after 100 loads have been done. And if by that time the lock happens to have been acquired, well and good (though in your program, that won't in fact happen). But the compiler isn't allowed to delay the store *until* the lock is acquired. Do you see the difference? — Nate Eldredge, Jul 12 '23 at 03:55
Ohhhhh, so do you mean that `lock()` will always make the store which _happened before_ visible in a time frame that's not determined by the time it takes to actually acquire the lock (which would be infinite until the store becomes visible)? — j0011, Jul 12 '23 at 03:58
It's not really to do with the `lock()`. Once you have reached the `store` statement, it has to become visible in finite time, regardless of what code follows: be it `lock()` or `recv()` or an infinite loop or what have you. — Nate Eldredge, Jul 12 '23 at 04:01
I see, and I'm guessing that finite time is not dependent on anything else. But what prevents the compiler from reordering the store until after the `lock()` or `recv()`? Would it be the intro progress p18 clause? — j0011, Jul 12 '23 at 04:03
If your code was set up such that the `lock()` *could* eventually complete, then it is possible that some of the `do something` would be visible before the store to `flag`. That's the result of the relaxed/acquire memory ordering. But once again, such behavior is irrelevant to this program. — Nate Eldredge, Jul 12 '23 at 04:03
I see, thank you. But "If your code was set up such that the lock() could eventually complete", in my code, `lock()` does eventually complete because setting the flag would cause `f2()`'s while loop to finish, eventually releasing the mutex. — j0011, Jul 12 '23 at 04:06
Give me a few minutes, I'll try to think of another way to say this. — Nate Eldredge, Jul 12 '23 at 04:06
Shoot, I made a slight mistake in my question. Flag is set to true before f() or f2() — j0011, Jul 12 '23 at 04:16
If you actually need an atomic store to commit before a later op without any other synchronization guaranteeing stuff, either they both need to be `atomic` stores and the second one has to be at least `release`, or they both need to be `atomic` with `seq_cst` if the second is a load. Or you need `atomic_thread_fence(release)` for a StoreStore barrier (and LoadStore) or `seq_cst` for a full barrier including StoreLoad for something that works in practice but isn't fully guaranteed. See [Implementing 64 bit atomic counter with 32 bit atomics](//stackoverflow.com/q/54611003) — Peter Cordes, Jul 12 '23 at 04:25
@PeterCordes But how would I change std::mutex to have a seq_cst order? If I couldn’t, then how would I utilize an atomic_memory_fence without a later atomic variable because all of the examples of using atomic_memory_fence use an atomic operation after the fence rather than before? (Assuming that, the store is the first op, and the std::mutex stuff is the second “op”) — j0011, Jul 12 '23 at 04:29
You can't. It's a mutex, not an `atomic<>` operation, so the last part applies, about using `atomic_thread_fence(release)` or `seq_cst` depending on whether you care about StoreStore or StoreLoad ordering (https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/). Unless you don't need that in this case if some other synchronization ends up guaranteeing what you need. — Peter Cordes, Jul 12 '23 at 04:31
*all of the examples of using atomic_memory_fence use an atomic operation after the fence rather than before?* - That's why I linked you [Implementing 64 bit atomic counter with 32 bit atomics](https://stackoverflow.com/q/54611003) in my previous comment, for a SeqLock example that uses a `fence` after an atomic store, and before an atomic load, opposite of normal usage, to order a plain or `volatile` access wrt. an `atomic` in a way that works in practice but is UB in ISO C++. — Peter Cordes, Jul 12 '23 at 04:33
@PeterCordes I see, thank you! But if I’m using std::mutex as the second op, then would that be considered an atomic load? If not, then how would I simulate that? Could I use a dummy atomic? — j0011, Jul 12 '23 at 04:35
@j0011: In ISO C++ it's not technically an `atomic` load. It is an RMW operation with `acquire` semantics (although on many implementations it will happen to be stronger, e.g. on x86 any atomic RMW is a full barrier). But there's no way to strengthen it to `seq_cst`, so you're going to need `atomic_thread_fence` and depend on how stuff works in practice for the general case of forcing a store to be visible before a lock-take. What are you talking about with a dummy atomic? You want to order your first operation wrt. some later operation that's already in your code, not some dummy operation. — Peter Cordes, Jul 12 '23 at 04:40
@PeterCordes Ohh, so I just have to put the seq_cst memory fence after the flag.clear() line? I assumed I needed a dummy atomic because I thought you needed an atomic load after the fence in order to make it work properly. — j0011, Jul 12 '23 at 04:45
@PeterCordes I was confused because [this][https://stackoverflow.com/a/13633355/22212465) question says you have to do the fence in reference to some atomic op, and they always do it after the fence, so I assumed I needed at least a store or load after the fence. — j0011, Jul 12 '23 at 04:48

Nate Eldredge · Accepted Answer · 2023-07-12T05:49:29.630

This question doesn't actually have anything to do with memory barriers. It's really a question of forward progress that is addressed by [intro.progress p18]:

An implementation should(*) ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.

So once f() has been called, and the flag.clear() line has been reached, then that value will be observed in finite time, come hell or high water. It doesn't matter what code comes next, be it lock() or recv() or wait_for_godot(). As such, the load in f2() will eventually return false. And thus, the compiler is forbidden for performing any transformation that could prevent that from taking place.

Note that this applies regardless of what std::memory_order is used, and doesn't need any fences or anything else.

One might think this implies that an atomic load or store can't be reordered with a mutex lock, nor with any other operation that can't be proved to always finish in finite time. But the situation is a little more subtle than that: read on.

On the other hand, it is still true that a mutex lock is an acquire barrier only. In general, it is allowed for operations sequenced before the lock to become visible after the lock is acquired, provided that the finite-time rule is still obeyed.

For instance, suppose the machine knows that the store to flag is going to take a long time (maybe it missed cache). Let's say 10us. It would then be allowed to "schedule" the store to execute when ready, and go on to start trying to lock the mutex. If it should happen that the mutex were successfully acquired in less than 10us, it would be fine for the machine to acquire it and start executing the // do something. In this case, the something could become visible before the store to flag. The key, however, is that if the mutex is not successfully acquired with 10us, then the machine promises to make the store visible anyway.

Now in your particular program, due to its logic, it is not actually possible for the mutex to be acquired before the store to flag is visible to f2. So the machine is free to start trying the mutex before 10us is up, and it can have the intention of immediately going on to // do something if successful, but it so happens that this will not come to pass.

So, we could say that a compiler is only allowed to unconditionally reorder an atomic load or store with another operation, if it can prove that the other operation must complete in finite time. But there can still be reordering if it's "conditional" as in the above example: "if the lock is acquired within 10us, then reorder it with the store, otherwise don't".

As a more concrete way to say it, the following is a legitimate way for the compiler to rewrite f():

if (m.try_lock()) {
    // do something
    flag.clear();
} else {
    flag.clear();
    m.lock();
    // do something
}
m.unlock();

(Assuming that do something itself always completes in finite time, and doesn't contain any release barriers.)

In this version, there is a scenario where the mutex has been taken, but the flag has not yet been cleared. (In your program, because of what the other thread is doing, the if branch is not actually reachable; but in some other context it could be.) That's the sense in which they can be reordered, while still satisfying the finite time rule.

But the following, which is what you are worried about, is forbidden:

m.lock();
flag.clear();
// ...

(*) Some people get worried because this says "should" instead of "shall" or "must". So technically, obeying [intro.progress p18] is a matter of quality-of-implementation rather than conformance. However, an implementation that doesn't do it will be so broken that, whether or not you consider it "conformant", it will be unusable.

There are a few instances of this kind of thing in the standard, another example being [atomics.order p8], the "no out-of-thin-air values" rule, which is also worded as "should". I think the issue is that they each specify a requirement in a somewhat informal way, "you obviously know what we mean here", rather than with mathematical formality. If they said "must", it could be a problem for some vendor trying to formally verify that their implementation conforms, because the requirement itself is not stated precisely enough to make that possible.

Thank you so much! I think I finally understand this now!! One confusion though, “ As such, the load in f2() will eventually return true”, do you mean that the load will eventually return false, because in the normal case it’s waiting with true, and false signals it to stop. — j0011, Jul 12 '23 at 05:11

Force an atomic store to occur before other operations

1 Answers1