C++: std::memory_order in std::atomic_flag::test_and_set to do some work only once by a set of threads

Question

Could you please help me to understand what std::memory_order should be used in std::atomic_flag::test_and_set to do some work only once by a set of threads and why? The work should be done by whatever thread gets to it first, and all other threads should just check as quickly as possible that someone is already going the work and continue working on other tasks.

In my tests of the example below, any memory order works, but I think that it is just a coincidence. I suspect that Release-Acquire ordering is what I need, but, in my case, only one memory_order can be used in both threads (it is not the case that one thread can use memory_order_release and the other can use memory_order_acquire since I do not know which thread will arrive to doing the work first).

#include <atomic>
#include <iostream>
#include <thread>

std::atomic_flag done = ATOMIC_FLAG_INIT;

const std::memory_order order = std::memory_order_seq_cst;
//const std::memory_order order = std::memory_order_acquire;
//const std::memory_order order = std::memory_order_relaxed; 

void do_some_work_that_needs_to_be_done_only_once(void) 
{ std::cout<<"Hello, my friend\n"; }

void run(void)
{
    if(not done.test_and_set(order))
        do_some_work_that_needs_to_be_done_only_once();
}

int main(void)
{
 std::thread a(run);
 std::thread b(run);
 a.join();
 b.join();
 // expected result:
 //  * only one thread said hello
 //  * all threads spent as little time as possible to check if any 
 //    other thread said hello yet
 return 0;
}

Thank you very much for your help!

I believe that all you need to do is set the flag once the work is done. But you've not put any protection in place to ensure only one thread can be in the function at a time. I admit my multi-threading is a bit weak. — sweenish, Aug 25 '22 at 14:23
If I only set the flag once the work is done, both of the threads could end up doing the work and setting the flag. I need to set a flag before the work begins and let all other threads know that someone is going to do the work, so that they would just move on. — S.V, Aug 25 '22 at 14:26
So if it's unset, immediately set it and do the work? Actually, scratch that. I believe that leads to an anti-pattern. — sweenish, Aug 25 '22 at 14:27
Yes, the work begins only if the flag is not set yet by any other thread, and only one thread should be able to set the flag. The work is done by whatever thread sets the flag first. — S.V, Aug 25 '22 at 14:30
I guess you should use std::memory_order_seq_cst. Some more detail can be found here: https://stackoverflow.com/questions/9553591/c-stdatomic-what-is-stdmemory-order-and-how-to-use-them — Grimkin, Aug 25 '22 at 14:43
`std::memory_order_seq_cst` would certainly work. I wanted to know what is the most relaxed memory order I could use. — S.V, Aug 25 '22 at 14:45

Peter Cordes · Answer 1 · 2022-08-25T16:32:11.617

relaxed is fine if you just need to determine the winner of the race to set the flag¹, so one thread can start on the work and later threads can just continue on.

If the run_once work produces data that other threads need to be able to read, you'll need a release store after that, to let potential readers know that the work is finished, not just started. If it was instead just something like printing or writing to a file, and other threads don't care when that finishes, then yeah you have no ordering requirements between threads beyond the modification order of done which exists even with relaxed. An atomic RMW like test_and_set lets you determines which thread's modification was first.

BTW, you should check read-only before even trying to test-and-set; unless run() is only called very infrequently, like once per thread startup. For something like a static int foo = non_constant; local var, compilers use a guard variable that's loaded (with an acquire load) to see if init is already complete. If it's not, branch to code that uses an atomic RMW to modify the guard variable, with one thread winning, the rest effectively waiting on a mutex for that thread to init.

You might want something like that if you have data that all threads should read. Or just use a static int foo = something_to_run_once(), or some type other than int, if you actually have some data to init.

Or perhaps use C++11 std::call_once to solve this problem for you.

On normal systems, atomic_flag has no advantage over and atomic_bool. done.exchange(true) on a bool is equivalent to test_and_set of a flag. But atomic_bool is more flexible in terms of the operations it supports, like plain read that isn't part of an RMW test-and-set.

C++20 does add a test() method for atomic_flag. ISO C++ guarantees that atomic_flag is lock-free, but in practice so is std::atomic<bool> on all real-world systems.

Footnote 1: why `relaxed` guarantees a single winner

The memory_order parameter only governs ordering wrt. operations on other variables by the same thread.

Does calling test_and_set by a thread force somehow synchronization of the flag with values written by other threads?

It's not a pure write, it's an atomic read-modify-write, so the result of the one that went first is guaranteed to be visible to the one that happens to be second. That's the whole point of test-and-set as a primitive building block for mutual exclusion.

If two TAS operations could both load the original value (false), and then both store true, they would be atomic. They'd have overlapped with each other.

Two atomic RMWs on the same atomic object must happen in some order, the modification-order of that object. (Because they're not read-only: an RMW includes a modification. But also includes a read so you can see what the value was immediately before the new value; that read is tied to the modification order, unlike a plain read).

Every atomic object separately has a modification-order that all threads can agree on; this is guaranteed by ISO C++. (With orders less than seq_cst, ordering between objects can be different from source order, and not guaranteed that all threads even agree which store happened first, the IRIW problem.)

Being an atomic RMW guarantees that exactly one test_and_set will return false in thread A or B. Same for fetch_add with multiple threads incrementing a counter: the increments have to happen in some order (i.e. serialized with each other), and whatever that order is becomes the modification-order of that atomic object.

Atomic RMWs have to work this way to not lose counts. i.e. to actually be atomic.

Are you sure that `relaxed` order would work in my case? My understanding is that in the case of `relaxed` order if thread A sets the flag first, then subsequent reads of the flag by thread A could only return `true`. However, the 1st time thread B reads the flag (before setting it), it can read either `false` or `true` without any guarantee which value would be read. — S.V, Aug 25 '22 at 15:40
My tests show that you are right, but I just do not understand why. Does calling `test_and_set` by a thread force somehow synchronization of the flag with values written by other threads? — S.V, Aug 25 '22 at 15:59
@S.V: The `memory_order` parameter only governs ordering wrt. operations on *other* variables by the same thread. Two atomic RMWs on the same atomic object must happen in some order, the modification-order of that object. (Because they're not read-only: an RMW includes a modification. But also includes a read so you can see what the value was immediately before the new value). — Peter Cordes, Aug 25 '22 at 16:19
@S.V: Nothing can guarantee that thread A is the one where `test_and_set` returns false (except some other synchronization), but being an atomic RMW guarantees that exactly one will return false in thread A or B. Same for fetch_add with multiple threads incrementing a counter: the increments have to happen in some order (i.e. serialized with each other), and whatever that order is becomes the modification-order of that atomic object. — Peter Cordes, Aug 25 '22 at 16:21
I am trying to synchronize what you are saying with what I read in the book "C++ Concurrency in Action" by Anthony Williams. According to the book, with `relaxed` order, "modification order of each variable is the only thing shared between threads", but we do not know the position in the modification order each thread is at since the position is not synchronized between threads (per thread CPU caches and internal buffers). So, if thread A wrote `true` to flag, thread B could still be reading value `false` from its CPU cache/buffer since B is still before setting `true` in modification order. — S.V, Aug 25 '22 at 16:58
So, with `memory_order_seq_cst` the position in the modification order of each variable is synchronized between different threads (and so all threads agree on the value of a variable), by with `memory_order_relaxed` no such synchronization is done (and so a thread could see value of a variable which it had after an earlier than the last operation in the modification order). — S.V, Aug 25 '22 at 17:09
@S.V: Did you look at the edit to my answer where I expanded in more detail why atomicity of an RMW is sufficient? The C++ standard guarantees that an atomic RMW sees the "latest value" in the modification order, which is necessary for atomicity; that's what serializes stores and RMWs on an object. (Don't read too much into it for other use-cases though; [a plain load will still see a very recent value on real hardware, an RMW isn't better](https://stackoverflow.com/a/71722126/224132).) — Peter Cordes, Aug 25 '22 at 20:24
_"The C++ standard guarantees that an atomic RMW sees the "latest value" in the modification order"_ : this is the statement which gave me the enlightenment! If possible, could you please let me know if all C++ standard versions guarantee it? Thank you! — S.V, Aug 26 '22 at 21:31
@S.V: C++11 and later, when std::atomic was added, of course. Like I said, atomic increments that could step on each other wouldn't be useful, and would be missing part of what it means for an RMW to be atomic. — Peter Cordes, Aug 26 '22 at 21:45

Nate Eldredge · Accepted Answer · 2022-08-26T05:10:02.610

Following up on some things in the comments:

As has been discussed, there is a well-defined modification order M for done on any given run of the program. Every thread does one store to done, which means one entry in M. And by the nature of atomic read-modify-writes, the value returned by each thread's test_and_set is the value that immediately precedes its own store in the order M. That's promised in C++20 atomics.order p10, which is the critical clause for understanding atomic RMW in the C++ memory model.

Now there are a finite number of threads, each corresponding to one entry in M, which is a total order. Necessarily there is one such entry that precedes all the others. Call it m1. The test_and_set whose store is entry m1 in M must return the preceding value in M. That can only be the value 0 which initialized done. So the thread corresponding to m1 will see test_and_set return 0. Every other thread will see it return 1, because each of their modifications m2, ..., mN follows (in M) another modification, which must have been a test_and_set storing the value 1.

We may not be bothering to observe all of the total order M, but this program does determine which of its entries is first on this particular run. It's the unique one whose test_and_set returns 0. A thread that sees its test_and_set return 1 won't know whether it came 2nd or 8th or 96th in that order, but it does know that it wasn't first, and that's all that matters here.

Another way to think about it: suppose it were possible for two threads (tA, tB) both to load the value 0. Well, each one makes an entry in the modification order; call them mA and mB. M is a total order so one has to go before the other. And bearing in mind the all-important [atomics.order p10], you will quickly find there is no legal way for you to fill out the rest of M.

All of this is promised by the standard without any reference to memory ordering, so it works even with std::memory_order_relaxed. The only effect of relaxed memory ordering is that we can't say much about how our load/store will become visible with respect to operations on other variables. That's irrelevant to the program at hand; it doesn't even have any other variables.

In the actual implementation, this means that an atomic RMW really has to exclusively own the variable for the duration of the operation. We must ensure that no other thread does a store to that variable, nor the load half of a read-modify-write, during that period. In a MESI-like coherent cache, this is done by temporarily locking the cache line in the E state; if the system makes it possible for us to lose that lock (like an LL/SC architecture), abort and start again.

As to your comment about "a thread reading false from its own cache/buffer": the implementation mustn't allow that in an atomic RMW, not even with relaxed ordering. When you do an atomic RMW, you must read it while you hold the lock, and use that value in the RMW operation. You can't use some old value that happens to be in a buffer somewhere. Likewise, you have to complete the write while you still hold the lock; you can't stash it in a buffer and let it complete later.

You mentioned _"by the nature of atomic read-modify-writes, the value returned by each thread's `test_and_set` is the value that immediately precedes its own store in the order M. That's promised in C++20 atomics.order p10"_. Could you please let me know if the same is true in the C++17 standard? Thank you! — S.V, Aug 26 '22 at 21:24
Yes, exactly the same. It's atomics.order p11 in the [N4659 final draft of C++17](https://open-std.org/jtc1/sc22/wg21/docs/papers/2017/n4659.pdf). "Atomic read-modify-write operations shall always read the last value (in the modification order) written before the write associated with the read-modify-write operation." It's been the same all the way back to C++11 when atomics were first introduced. — Nate Eldredge, Aug 27 '22 at 04:06

C++: std::memory_order in std::atomic_flag::test_and_set to do some work only once by a set of threads

2 Answers2

Footnote 1: why `relaxed` guarantees a single winner

Linked

C++: std::memory_order in std::atomic_flag::test_and_set to do some work only once by a set of threads

2 Answers2

Footnote 1: why relaxed guarantees a single winner

Linked

Footnote 1: why `relaxed` guarantees a single winner