relaxed ordering and inter thread visibility

Question

I learnt from relaxed ordering as a signal that a store on an atomic variable should be visible to other thread in a "within a reasonnable amount of time".

That say, I am pretty sure it should happen in a very short time (some nano second ?). However, I don't want to rely on "within a reasonnable amount of time".

So, here is some code :

std::atomic_bool canBegin{false};
void functionThatWillBeLaunchedInThreadA() {
    if(canBegin.load(std::memory_order_relaxed))
        produceData();
}

void functionThatWillBeLaunchedInThreadB() {
    canBegin.store(true, std::memory_order_relaxed);
}

Thread A and B are within a kind of ThreadPool, so there is no creation of thread or whatsoever in this problem. I don't need to protect any data, so acquire / consume / release ordering on atomic store/load are not needed here (I think?).

We know for sure that the functionThatWillBeLaunchedInThreadAfunction will be launched AFTER the end of the functionThatWillBeLaunchedInThreadB.

However, in such a code, we don't have any guarantee that the store will be visible in the thread A, so the thread A can read a stale value (false).

Here are some solution I think about.

Solution 1 : Use volatility

Just declare volatile std::atomic_bool canBegin{false}; Here the volatileness guarantee us that we will not see stale value.

Solution 2 : Use mutex or spinlock

Here the idea is to protect the canBegin access via a mutex / spinlock that guarantee via a release/acquire ordering that we will not see a stale value. I don't need canGo to be an atomic either.

Solution 3 : not sure at all, but memory fence?

Maybe this code will not work, so, tell me :).

bool canGo{false}; // not an atomic value now
// in thread A
std::atomic_thread_fence(std::memory_order_acquire);
if(canGo) produceData();

// in thread B
canGo = true;
std::atomic_thread_fence(std::memory_order_release);

On cpp reference, for this case, it is write that :

all non-atomic and relaxed atomic stores that are sequenced-before FB in thread B will happen-before all non-atomic and relaxed atomic loads from the same locations made in thread A after FA

Which solution would you use and why?

Don't use `volatile` _"This makes volatile objects suitable for communication with a signal handler, but not with another thread of execution"_ source: https://en.cppreference.com/w/cpp/language/cv — Richard Critten, Jul 06 '19 at 08:50
As far as I know volatile is useless when used with atomic, please see https://stackoverflow.com/questions/8819095/concurrency-atomic-and-volatile-in-c11-memory-model — CuriouslyRecurringThoughts, Jul 06 '19 at 09:10
Anthony Williams wrote : The only way to guarantee you have the "latest" value is to use a read-modify-write operation such as exchange(), compare_exchange_strong() or fetch_add(). Read-modify-write operations have an additional constraint that they always operate on the "latest" value, so a sequence of ai.fetch_add(1) operations by a series of threads will return a sequence of values with no duplicates or gaps. In the absence of additional constraints, there's still no guarantee which threads will see which values though. — Antoine Morrier, Jul 06 '19 at 09:29
So I think volatile maybe needed in this case. No ? Or I need to use a RMW — Antoine Morrier, Jul 06 '19 at 09:31
What's confusing is that you say "_there is no creation of thread or whatsoever_" and then after that: "_We know for sure that the functionThatWillBeLaunchedInThreadBfunction will be launched AFTER the end of the functionThatWillBeLaunchedInThreadA_". So how is the order between those thread functions enforced ? — LWimsey, Jul 06 '19 at 17:00
@LWimsey because I actually do something like that : In the main thread, I do :`functionThatWillBeLaunchedInThreadB(); threadPool.add(functionThatWillBeLaunchedInThreadA);` That guarantee that the second function will be executed AFTER the end of the first one :) — Antoine Morrier, Jul 07 '19 at 08:02
@LWimsey btw, I inverted thread A and thread B :), thanks for the correction — Antoine Morrier, Jul 07 '19 at 08:07
@AntoineMorrier What is a latest value? Where is that defined? — curiousguy, Jul 08 '19 at 05:48
@curiousguy : The latest value is the last value wrote. Not a old value from the cache or whatsoever — Antoine Morrier, Jul 09 '19 at 16:34
@AntoineMorrier You mean a CPU cache? They don't work that way in any currently used arch where C/C++/Java is supported. Caches hide old data as soon as another CPU gets permission to change it, and it isn't racy, it's synchronous. Of course other threads that have already copied values locally, in regs or on the stack, are another issue entirely. The cache issue (which doesn't exist) is being described as a real problem on all sorts of blogs and even IBM website. So much disinfo! (That however doesn't mean that there can't be stores not yet visible in memory.) — curiousguy, Jul 09 '19 at 23:59
@AntoineMorrier You missed my point. Last compared to what? In which timeline? What about the other memory operations of the same thread, do they get early or late effects? "Late" is meaningless in MT semantics. It's such a mess! — curiousguy, Jul 10 '19 at 00:00
@curiousguy Single modification order means that all atomic operations on the variable occur in some order, but that order does not necessarily follow clock time.A load can occur later (in clock time) than a store, but still be ordered before the store — LWimsey, Jul 10 '19 at 01:54
@LWimsey According to which clock? The clock of every single thread? — curiousguy, Jul 10 '19 at 04:19
@curiousguy Latest in the modification order isn't really defined for single loads and stores, but it sure is for RMW operations. — LWimsey, Jul 10 '19 at 04:42
@LWimsey I understand that latest just means the read is just before the write.* — curiousguy, Jul 10 '19 at 04:49
@curiousguy the read is logically before the write, but not always in clock time — LWimsey, Jul 10 '19 at 04:58
@RichardCritten You should use volatile when you want each C/C++ operation to be translated to a single asm action, which seems to be the case here; however making some operations volatile fails to express any relation with non volatile operations. You have to tie the volatile operations to the others. — curiousguy, Jul 10 '19 at 19:26
There is nothing you can write to force a write to become visible to another thread any sooner. Fences and synchronization instructions ensure that if you see *this* value then *that* value is also visible, but don't make *this* value propagate any quicker. The best you can hope for is to repeatedly poll, or use a facility such as C++20 atomic waits to suspend your thread until the change has happened. — Anthony Williams, Jun 09 '21 at 20:25

Peter Cordes · Answer 1 · 2023-08-16T17:57:17.020

There's nothing you can do to make a store visible to other threads any sooner. See If I don't use fences, how long could it take a core to see another core's writes? - barriers don't speed up visibility to other cores, they just make this core wait until that's happened.

The store part of an RMW is not different from a pure store for this, either.

(Certainly on x86; not totally sure about other ISAs, where a relaxed LL/SC might possibly get special treatment from the store buffer, possibly being more likely to commit before other stores if this core can get exclusive ownership of the cache line. But I think it still would have to retire from out-of-order execution so the core knows it's not speculative.)

Anthony's answer that was linked in comment is misleading; as I commented there:

If the RMW runs before the other thread's store commits to cache, it doesn't see the value, just like if it was a pure load. Does that mean "stale"? No, it just means that the store hasn't happened yet.

The only reason RMWs need a guarantee about "latest" value is that they're inherently serializing operations on that memory location. This is what you need if you want 100 unsynchronized fetch_add operations to not step on each other and be equivalent to += 100, but otherwise best-effort / latest-available value is fine, and that's what you get from a normal atomic load.

If you require instant visibility of results (a nanosecond or so), that's only possible within a single thread, like x = y; x += z;

Also note, the C / C++ standard requirement (actually just a note) to make stores visible in a reasonable amount of time is in addition to the requirements on ordering of operations. It doesn't mean seq_cst store visibility can be delayed until after later loads. All seq_cst operations happen in some interleaving of program order across all threads.

On real-world C++ implementations, the visibility time is entirely up to hardware inter-core latency. But the C++ standard is abstract, and could in theory be implemented on a CPU that required manual flushing to make stores visible to other threads. Then it would be up to the compiler to not be lazy and defer that for "too long".

volatile atomic<T> is useless; compilers already don't optimize atomic<T>, so every atomic access done by the abstract machine will already happen in the asm. (Why don't compilers merge redundant std::atomic writes?). That's all that volatile does, so volatile atomic<T> compiles to the same asm as atomic<T> for anything you can with the atomic.

Defining "stale" is a problem because separate threads running on separate cores can't see each other's actions instantly. It takes tens of nanoseconds on modern hardware to see a store from another thread.

But you can't read "stale" values from cache; that's impossible because real CPUs have coherent caches. (That's why volatile int could be used to roll your own atomics before C++11, but is no longer useful.) You may need an ordering stronger than relaxed to get the semantics you want as far as one value being older than another (i.e. "reordering", not "stale"). But for a single value, if you don't see a store, that means your load executed before the other core took exclusive ownership of the cache line in order to commit its store. i.e. that the store hasn't truly happened yet.

In the formal ISO C++ rules, there are guarantees about what value you're allowed to see which effectively give you the guarantees you'd expect from cache coherency for a single object, like that after a reader sees a store, further loads in this thread won't see some older store and then eventually back to the newest store. (https://eel.is/c++draft/intro.multithread#intro.races-19).

(Note for 2 writers + 2 readers with non-seq_cst operations, it's possible for the readers to disagree about the order in which the stores happened. This is called IRIW reordering, but most hardware can't do it; only some PowerPC. Will two atomic writes to different locations in different threads always be seen in the same order by other threads? - so it's not always quite as simple as "the store hasn't happened yet", it can be visible to some threads before others. But it's still true that you can't speed up visibility, only for example slow down the readers so none of them see it via the "early" mechanism, i.e. with hwsync for the PowerPC loads to drain the store buffer first.)

score 1 · Answer 2 · edited May 27 '21 at 20:59

1

We know for sure that the functionThatWillBeLaunchedInThreadAfunction will be launched AFTER the end of the functionThatWillBeLaunchedInThreadB.

First of all, if this is the case then it's likely that your task queue mechanism takes care of the necessary synchronization already.

On to the answer...

By far the simplest thing to do is acquire/release ordering. All the solutions you gave are worse.

std::atomic_bool canBegin{false};

void functionThatWillBeLaunchedInThreadA() {
    if(canBegin.load(std::memory_order_acquire))
        produceData();
}

void functionThatWillBeLaunchedInThreadB() {
    canBegin.store(true, std::memory_order_release);
}

By the way, shouldn't this be a while loop?

void functionThatWillBeLaunchedInThreadA() {
    while (!canBegin.load(std::memory_order_acquire))
    { }
    produceData();
}

I don't need to protect any data, so acquire / consume / release ordering on atomic store/load are not needed here (I think?)

In this case, the ordering is required to keep the compiler/CPU/memory subsystem from ordering the canBegin store true before the previous reads/writes have completed. And it should actually stall the CPU until it can be guaranteed that every write that comes before in program order will propagate before the store to canBegin. On the load side it prevents memory from being read/written before canBegin is read as true.

However, in such a code, we don't have any guarantee that the store will be visible in the thread A, so the thread A can read a stale value (false).

You said yourself:

a store on an atomic variable should be visible to other thread in a "within a reasonnable amount of time".

Even with relaxed memory order, a write is guaranteed to eventually reach the other cores and all cores will eventually agree on any given variable's store history, so there are no stale values. There are only values that haven't propagated yet. What's "relaxed" about it is the store order in relation to other variables. Thus, memory_order_relaxed solves the stale read problem (but doesn't address the ordering required as discussed above).

Don't use volatile. It doesn't provide all the guarantees required of atomics in the C++ memory model, so using it would be undefined behavior. See https://en.cppreference.com/w/cpp/atomic/memory_order#Relaxed_ordering at the bottom to read about it.

You could use a mutex or spinlock, but a mutex operation is much more expensive than a lock-free std::atomic acquire-load/release-store. A spinlock will do at least one atomic read-modify-write operation...and possibly many. A mutex is definitely overkill. But both have the benefit of simplicity in the C++ source. Most people know how to use locks so it's easier to demonstrate correctness.

A memory fence will also work but your fences are in the wrong spot (it's counter-intuitive) and the inter-thread communication variable should be std::atomic. (Careful when playing these games...! It's easy to get undefined behavior) Relaxed ordering is ok thanks to the fences.

std::atomic<bool> canGo{false}; // MUST be atomic

// in thread A
if(canGo.load(std::memory_order_relaxed))
{
    std::atomic_thread_fence(std::memory_order_acquire);
    produceData();
}

// in thread B
std::atomic_thread_fence(std::memory_order_release);
canGo.store(true, memory_order_relaxed);

The memory fences are actually more strict than acquire/release ordering on the std::atomicload/store so this gains nothing and could be more expensive.

It seems like you really want to avoid overhead with your signaling mechanism. This is exactly what the std::atomic acquire/release semantics were invented for! You are worrying too much about stale values. Yes, an atomic RMW will give you the "latest" value, but they're also very expensive operations themselves. I want to give you an idea of how fast acquire/release is. It's most likely that you're targeting x86. x86 has total store order and word-sized loads/stores are atomic, so an load acquire compiles to just a regular load and and a release store compiles to a regular store. So it turns out that almost everything in this long post will probably compile to exactly the same code anyway.

edited May 27 '21 at 20:59

Peter Cordes

328,167
45
605
847

answered Jul 09 '19 at 10:12

Humphrey Winnebago

1,512
8
15

First of all, thanks for your superb answer. Let's say it is a hypothetical example :). `By far the simplest thing to do is acquire/release ordering.` Acquire / release ordering is useless here because we don't need to prevent anything to be read / write after / before the load / store. Moreover, even sequentially consistency ordering does not guarantee to see the latest value. But that's true if we need to enforce the ordering (that is very likely in our case) We can't use a loop because the user may never want to produce the value, so the loop will loop forever... Following at bottom. – Antoine Morrier Jul 09 '19 at 17:56
So okay, stale value is maybe not the good expression and not propagated yet is better. I agree with you on not to use volatile. It was not a good idea. The only way to get the latest value is a RMW operation, or protect everything by a mutex or a spinlock as you said :). Concerning memory fence, here it is what is write on cpp-reference : `In this case, all non-atomic and relaxed atomic stores that are sequenced-before FA in thread A will happen-before all non-atomic and relaxed atomic loads from the same locations made in thread B after FB`. Following... – Antoine Morrier Jul 09 '19 at 18:04
To me it is clear that we can replace relaxed and non atomic load / store. Or maybe I misunderstood something? I am targetting a kind of arm architecture. Not a x86. But since the example is purely hypothetical, the idea is just to write a code that will works everywhere, even on a hypothetical architecture where "reasonnable amount of time" means several years. Thanks again for your answer :) – Antoine Morrier Jul 09 '19 at 18:12
1

The signaling variable must be `std::atomic` or it is undefined behavior. Without this, there is no guarantee of atomic write. This also necessitates the proper byte alignment so a variable doesn't straddle a cache line. A different core could see a partial result before the write is finished. Just the possibility of this happening means your whole program is classified as UB... ; An atomic RMW will get the latest value, but it will not be worth it because it will be more expensive than just waiting for the atomic store to propagate to all cores. – Humphrey Winnebago Jul 10 '19 at 07:30
It can easily slow down other operations the cores want to do. A couples years is definitely NOT a "reasonable amount of time" :) You could probably force the issue with a release fence following the write, but this, again, would probably be more expensive in the end. TBH if you came up with a way to test it and show the results, it would be extremely useful information. ; `Acquire / release ordering is useless here because we don't need to prevent anything to be read / write after / before the load / store` What if the functions are inlined? – Humphrey Winnebago Jul 10 '19 at 07:33
@HumphreyWinnebago Do release fence usually push writes faster? – curiousguy Jul 14 '19 at 11:38
3

@curiousguy Not at all. *"Memory barriers are not a way of making prior memory accesses get done faster."* (https://lwn.net/Articles/573436/). It's just that having the release barrier doesn't give the CPU much else to do. Don't quote me on that, though – Humphrey Winnebago Sep 22 '19 at 18:13
*because of the data dependency.* - but you don't have a data dependency, you have a *control* dependency on the load value, i.e. a branch. A *data* dependency is like `int idx = shared.load()` / `tmp = arr[idx];` where you use the load result as a pointer or index or some other part of the address calculation for another load. Most commonly just loading a pointer and dereffing it directly, e.g. to follow a linked list and read elements of that list. – Peter Cordes May 27 '21 at 20:45
*An atomic operation is much more expensive than an acquire/release* - What? Atomic ops are cheaper than mutex acquire / mutex release. The rest of that paragraph seems to be (correctly) arguing that a mutex or spinlock acquire/release requires at least one atomic RMW operation, and thus is *more* expensive than directly doing an atomic operation, especially a pure-load or pure-store that isn't seq_cst. (Assuming `atomic::is_lock_free()` of course, like it will be for small primitive types like int or bool.) Hopefully that was a typo for "A mutex operation is more expensive ..." – Peter Cordes May 27 '21 at 20:48
@PeterCordes Works for me. I need to study reordering around the different dependencies a little more. And yes, I meant to say something like a RMW instruction is more expensive than and atomic acquire-read or store-release. – Humphrey Winnebago May 28 '21 at 04:34
See Paul McKenny's CppCon 2015 talk [C++ Atomics: The Sad Story of memory_order_consume](https://isocpp.org/blog/2016/08/cppcon-2015-cpp-atomics-the-sad-story-of-memory-order-consume-paul-e.-mcken) for more about dependency-ordering, and how Linux RCU currently uses it (with just `volatile` in the kernel, or `mo_relaxed` in user-space), and code carefully so the compiler can't optimize away the data dependency.) – Peter Cordes May 28 '21 at 05:02
I also wrote some answers, [C++11: the difference between memory\_order\_relaxed and memory\_order\_consume](https://stackoverflow.com/a/59832012) / [Memory order consume usage in C11](https://stackoverflow.com/q/55741148). (And BTW, I posted a more specific answer to *this* question, about the querent's wrong idea that they could speed up inter-thread visibility somehow, that plain store wasn't already visible as fast as possible.) – Peter Cordes May 28 '21 at 05:03

relaxed ordering and inter thread visibility

Solution 1 : Use volatility

Solution 2 : Use mutex or spinlock

Solution 3 : not sure at all, but memory fence?

2 Answers2

Linked