The C++ standard is rather vague about making atomic stores visible to other threads..
29.3.12
Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.
That is as detailed as it gets, there is no definition of 'reasonable', and it does not have to be immediately.
Using a stand-alone fence to force a certain memory ordering is not necessary since you can specify those on atomic operations, but the question is,
what is your expectation with regards to using a memory fence..
Fences are designed to enforce ordering on memory operations (between threads), but they do not guarantee visibility in a timely manner.
You can store a value to an atomic variable with the strongest memory ordering (ie. seq_cst
), but even when another thread executes load()
at a later time than the store()
,
you might still get an old value from the cache and yet (surprisingly) it does not violate the happens-before relationship.
Using a stronger fence might make a difference wrt. timing and visibility, but there are no guarantees.
If prompt visibility is important, I would consider using a Read-Modify-Write (RMW) operation to load the value.
These are atomic operations that read and modify atomically (ie. in a single call), and have the additional property that they are guaranteed to operate on the latest value.
But since they have to reach a little further than the local cache, these calls also tend to be more expensive to execute.
As pointed out by Maxim Egorushkin, whether or not you can use weaker memory orderings than the default (seq_cst
) depends on whether other memory operations need to be synchronized (made visible) between threads.
That is not clear from your question, but it is generally considered safe to use the default (sequential consistency).
If you are on an unusually weak platform, if performance is problematic, and if you need data synchronization between threads, you could consider using acquire/release semantics:
// thread 1
device_reg_copy.store(*Device_reg_ptr, std::memory_order_release);
// thread 2
device_reg_copy.fetch_add(0, std::memory_order_acquire);
If thread 2 sees the value written by thread 1, it is guaranteed that memory operations prior to the store in thread 1 are visible after the load in thread 2.
Acquire/Release operations form a pair and they synchronize based on a run-time relationship between the store and load. In other words, if thread 2 does not see the value stored by thread 1,
there are no ordering guarantees.
If the atomic variable has no dependencies on any other data, you can use std::memory_order_relaxed
; store ordering is always guaranteed for a single atomic variable.
As mentioned by others, there is no need for volatile
when it comes to inter-thread communication with std::atomic
.