Do I need a memory barrier for a change notification flag between threads?

Question

I need a very fast (in the sense "low cost for reader", not "low latency") change notification mechanism between threads in order to update a read cache:

The situation

Thread W (Writer) updates a data structure (S) (in my case a setting in a map) only once in a while.

Thread R (Reader) maintains a cache of S and does read this very frequently. When Thread W updates S Thread R needs to be notified of the update in reasonable time (10-100ms).

Architecture is ARM, x86 and x86_64. I need to support C++03 with gcc 4.6 and higher.

Code

is something like this:

// variables shared between threads
bool updateAvailable;
SomeMutex dataMutex;
std::string myData;

// variables used only in Thread R
std::string myDataCache;

// Thread W
SomeMutex.Lock();
myData = "newData";
updateAvailable = true;
SomeMutex.Unlock();

// Thread R

if(updateAvailable)
{
    SomeMutex.Lock();
    myDataCache = myData;
    updateAvailable = false;
    SomeMutex.Unlock();
}

doSomethingWith(myDataCache);

My Question

In Thread R no locking or barriers occur in the "fast path" (no update available). Is this an error? What are the consequences of this design?

Do I need to qualify updateAvailable as volatile?

Will R get the update eventually?

My understanding so far

Is it safe regarding data consistency?

This looks a bit like "Double Checked Locking". According to http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html a memory barrier can be used to fix it in C++.

However the major difference here is that the shared resource is never touched/read in the Reader fast path. When updating the cache, the consistency is guaranteed by the mutex.

Will R get the update?

Here is where it gets tricky. As I understand it, the CPU running Thread R could cache updateAvailable indefinitely, effectively moving the Read way way before the actual if statement.

So the update could take until the next cache flush, for example when another thread or process is scheduled.

P.S.: I know about http://stackoverflow.com/questions/8517969/is-this-the-correct-way-to-atomically-read-and-write-a-bool — Hannah S., Nov 27 '15 at 11:21
The code has undefined behaviour, period. Adding `volatile` doesn't change that. — Jonathan Wakely, Nov 30 '15 at 17:02
Can you elaborate? The only way I see where it is undefined is if the CPU does not reliably propagate the store to the other cores. `volatile` should ensure the compiler actually issues a store to a memory address (whatever the CPU decides to do with that is another thing... - it could be held in a write queue, in cache or whatever before reaching main memory) — Hannah S., Nov 30 '15 at 17:10
A non-atomic read on a variable that is concurrently modified by another thread is undefined behaviour. Adding `volatile` doesn't change that **at all**. The standard is clear about that, and breaking that rule is UB, so you cannot predict what a compiler will do to your code. The standard says nothing about CPUs or other cores, it says that the only way to ensure correct behaviour in this case is with atomic operations. http://isvolatileusefulwiththreads.com/ — Jonathan Wakely, Nov 30 '15 at 17:12
@JonathanWakely this is why I stated the architecture and used compiler. Things I can easily check in my code for and provide a generic alternative that is less performant for environments where the implementation is unknown. Something that is undefined in the standard can very well be defined for a certain architecture and compiler. — Hannah S., Nov 30 '15 at 17:16
As I commented on your other question, the compiler implements the language standard. If you want to reason about your code at the hardware level then write assembly. If you ask a C++ compiler to compile your C++ code then it had better follow the rules of the C++ standard, otherwise all bets are off. If you want to access variables from multiple threads then use locks to prevent concurrent reads and writes, or use atomic operations. Modern versions of GCC optimize assuming you have no undefined behaviour, and volatile doesn't stop it being undefined, on any architecture. — Jonathan Wakely, Nov 30 '15 at 17:35
Ok, I see your point that the standard does not guarantee that `bool` is updated atomically. (However I doubt it would be implemented in any other way...) What if I would change the type of `updateAvailable` to `sig_atomic_t`? — Hannah S., Nov 30 '15 at 17:55
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/96583/discussion-between-hanno-s-and-jonathan-wakely). — Hannah S., Nov 30 '15 at 18:01
I'm not talking about "atomic" as in avoiding word-tearing, I'm talking about what the C++ standard means by atomic. A `sig_atomic_t` is not an atomic variable, so still no atomic operations, so still undefined behaviour. Use https://gcc.gnu.org/onlinedocs/gcc-4.6.4/gcc/Atomic-Builtins.html or (since GCC 4.7) https://gcc.gnu.org/onlinedocs/gcc-4.7.4/gcc/_005f_005fatomic-Builtins.html for atomic operations in C++03. — Jonathan Wakely, Nov 30 '15 at 18:06
@HannoS. You don't think `j = !j;` (where `j` is a bool) could be implemented as a read of `j` followed by a write of `j`? What's your basis for that? — David Schwartz, Nov 30 '15 at 20:03
@David: where does the code do that? Clearly this is a non-atomic operation. By update in context of this question I am of course referring to storing a constant value or loading into a register. — Hannah S., Nov 30 '15 at 22:23
@HannoS. That's just an "I can't think of a way this can go wrong" argument. Those are never valid. You can't know every possible optimization a compiler, today or in the future, might possibly make. A compiler might optimize `j=0;` to `j=a; j-=a;`, and I've seen real world optimizations that were just as surprising and broke code. — David Schwartz, Nov 30 '15 at 23:56
@DavidSchwartz: Thanks, that was the kind of answer I was looking for. The discussion convinced me that if I rely on the compiler generating certain instructions it is better to write those instructions in assembler directly. — Hannah S., Dec 02 '15 at 10:28
@JonathanWakely: atomic means one-thread-at-a-time. Visibility is not defined only for atomix. Compilers may guarantee the standard, but the architecture may have stronger guarantees. And as you must know, people have been using volatiles and atomix way before C++ had any notion of atomix. I'm not sure about ARM, but on x86 the volatile, with gcc 4x would be just enough. I've been using this for long enough for more than just one variable.... you can make a simple experiment to prove it to yourself. I'd also suggest you look into seqlocks and spinlocks. — BitWhistler, Jan 23 '16 at 00:07
@BitWhistler, you cannot do a simple experiment to prove that something with undefined behaviour is safe. "It appears to work" is a possible consequence of undefined behaviour, but proves nothing, especially not that it will continue to work with tomorrow's compilers. — Jonathan Wakely, Jan 23 '16 at 17:08
Look Jonathan, I'm not going to argue with you forever, find proof in intel docs or anything else. I'm getting the ultimate lowest latency and am very happy with it. If you want to pay a bit more to feel better, safer, more compliant, etc, then by all means do it. — BitWhistler, Jan 23 '16 at 17:19

score 2 · Answer 1 · answered Nov 27 '15 at 12:34

Use C++ atomics and make updateAvailable an std::atomic<bool>. The reason for this is that it's not just the CPU that can see an old version of the variable but especially the compiler which doesn't see the side effect of another thread and thus never bothers to refetch the variable so you never see the updated value in the thread. Additionally, this way you get a guaranteed atomic read, which you don't have if you just read the value.

Other than that, you could potentially get rid of the lock, if for example the producer only ever produces data when updateAvailable is false, you can get rid of the mutex because the std::atomic<> enforces proper ordering of the reads and writes. If that's not the case, you'll still need the lock.

std::atomic is not available in C++03. getting rid of the lock is not needed since updates are very seldom — Hannah S., Nov 27 '15 at 13:06
@HannoS. then you need to use the built-ins of your compiler/platform to provide the atomic operations. Volatile is only good for signal handler use. — JustSid, Nov 30 '15 at 21:51

SergeyA · Answer 2 · 2015-11-27T18:48:53.087

2

You do have to use a memory fence here. Without the fence, there is no guarantee updates will be ever seen on the other thread. In C++03 you have the option of either using platform-specific ASM code (mfence on Intel, no idea about ARM) or use OS-provided atomic set/get functions.

edited Nov 27 '15 at 18:48

answered Nov 27 '15 at 18:40

SergeyA

61,605
5
78
137

C++ standard require **garantee**, that writting into any location will be seen by other thread in **reasonable amount of time**. If you want precise wording - I will find it in the standard. – Tsyvarev Nov 27 '15 at 20:42
1

@Tsyvarev Nope. You are very confused. Those gurantees are only applicable when used atomics. – SergeyA Nov 27 '15 at 20:43
29.3.13: `Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.`. Yes, it is not about accesses to *any* variable, but *atomic store* includes `.store` with `memory_order_relaxed` order, that is without fences. As for fences in given example, they are porivided by .lock() and .unlock() operations. – Tsyvarev Nov 27 '15 at 22:10
A colleague indicated to me that the *ever* part is guaranteed by the CPUs cache coherence protocol. Both x86 and armv7mp have cache coherence protocols in place. Can you comment on that? – Hannah S. Nov 28 '15 at 10:06
1

@HannoS. That's only relevant if you're trying to explain how things happen to work on particular platforms. It's not relevant to the guarantees you have. – David Schwartz Nov 30 '15 at 23:57
I *am* trying to find out what happens on particular platforms. That's why I stated "Architecture is ARM, x86 and x86_64. I need to support C++03 with gcc 4.6 and higher." I am not asking whether there is a general way to do this with C++03 on any platform. (there isn't as C++03 does not know about threads) – Hannah S. Dec 02 '15 at 17:24

score 1 · Accepted Answer · edited Jan 18 '21 at 12:34

1

Do I need to qualify updateAvailable as volatile?

As volatile doesn't correlate with threading model in C++, you should use atomics for make your program strictly standard-confirmant:

On C++11 or newer preferable way is to use atomic<bool> with memory_order_relaxed store/load:

atomic<bool> updateAvailable;

//Writer
....
updateAvailable.store(true, std::memory_order_relaxed); //set (under mutex locked)

// Reader

if(updateAvailable.load(std::memory_order_relaxed)) // check
{
    ...
    updateAvailable.store(false, std::memory_order_relaxed); // clear (under mutex locked)
    ....
}

gcc since 4.7 supports similar functionality with in its atomic builtins.

As for gcc 4.6, it seems there is not strictly-confirmant way to evade fences when access updateAvailable variable. Actually, memory fence is usually much faster than 10-100ms order of time. So you can use its own atomic builtins:

int updateAvailable = 0;

//Writer
...
__sync_fetch_and_or(&updateAvailable, 1); // set to non-zero
....

//Reader
if(__sync_fetch_and_and(&updateAvailable, 1)) // check, but never change
{
    ...
    __sync_fetch_and_and(&updateAvailable, 0); // clear
    ...
}

Is it safe regarding data consistency?

Yes, it is safe. Your reason is absolutely correct here:

the shared resource is never touched/read in the Reader fast path.

This is NOT double-check locking!

It is explicitely stated in the question itself.

In case when updateAvailable is false, Reader thread uses variable myDataCache which is local to the thread (no other threads use it). With double-check locking scheme all threads use shared object directly.

Why memory fences/barriers are NOT NEEDED here

The only variable, accessed concurrently, is updateAvailable. myData variable is accessed with mutex protection, which provides all needed fences. myDataCache is local to the Reader thread.

When Reader thread sees updateAvailable variable to be false, it uses myDataCache variable, which is changed by the thread itself. Program order garantees correct visibility of changes in that case.

As for visibility garantees for variable updateAvailable, C++11 standard provide such garantees for atomic variable even without fences. 29.3 p13 says:

Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.

Jonathan Wakely has confirmed, that this paragraph is applied even to memory_order_relaxed accesses in chat.

edited Jan 18 '21 at 12:34

Community

1
1

answered Nov 27 '15 at 12:39

Tsyvarev

60,011
17
110
153

@SergeyA: Atomics needs only for evade **Undefined behavior** according to C++11 **standard**. But all commonly-used compilers(including `gcc`) generate correct code even without it. Also, asker explicitely says about C++03, which lacks atomics. – Tsyvarev Nov 27 '15 at 20:34
3

First of all, volatile will not work in this way on ARM. Period. It works on Intel due to strict ordering guarantees, which are not provided on ARM. Second, atomics do exist still outside of C++ - see my answer. – SergeyA Nov 27 '15 at 20:38
I think `volatile` instructs the compiler not to cache the variable in a register, also on ARM. Due to cache coherency protocols (which ARM also has) this should be sufficient if the read/write queues of the CPUs are cleared in a reasonable amount of time. – Hannah S. Nov 30 '15 at 11:36
@HannoS.: Yes, `volatile` is needed only for preventing compiler's caching, and this is stated in my answer. I am unsure what @SergeyA means by `volatile will not work in this way on ARM`. – Tsyvarev Nov 30 '15 at 11:46
Ok, as `volatile` is not strictly-safe within C++ threading model, I have reworked my answer. – Tsyvarev Nov 30 '15 at 19:15
@Tsyvarev As typically implemented, `volatile` happens to provide store ordering on x86 but not arm. – David Schwartz Nov 30 '15 at 19:25
@DavidSchwartz: Variable `updateAvailable` **doesn't require ordering garantees** here. In case of *false* value, such garantee is provided by the mutex. In case of *true* value such garantee is provided simply by program order: `myDataCache` variable, accessed in that case, is completely **local to the reader thread**, so no concurrent issue is possible with this variable. – Tsyvarev Nov 30 '15 at 19:59
@Tsyvarev He does require ordering guarantees, otherwise there is no assurance another thread will *ever* see the change no matter what it does after making it. Think about the implications of 29.3.13. – David Schwartz Nov 30 '15 at 20:01
I found this clause in C++11 standard: `29.3.13: Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.`. I think it is applicable to `memory_order_relaxed` stores and loads. Am I wrong with it? (Note, that my answer currently advice using *atomic*, not a *volatile*). – Tsyvarev Nov 30 '15 at 20:08
You can make the store relaxed since there is nothing really done with the information on that thread afterwards and the load can be an acquire operation. – JustSid Nov 30 '15 at 21:46
1

@JustSid: Why load cannot be relaxed operation too? According to the asker description, it is exactly load operation which should be as faster as possible. – Tsyvarev Nov 30 '15 at 22:29
1

@DavidSchwartz: Here is NO double-checked locking! This is explicitely stated in the question post itself. I have added this note into my answer, please check it. – Tsyvarev Dec 01 '15 at 08:02

Do I need a memory barrier for a change notification flag between threads?

3 Answers3

Linked