Volatile vs. memory fences

Question

The code below is used to assign work to multiple threads, wake them up, and wait until they are done. The "work" in this case consists of "cleaning a volume". What exactly this operation does is irrelevant for this question -- it just helps with the context. The code is part of a huge transaction processing system.

void bf_tree_cleaner::force_all()
{
    for (int i = 0; i < vol_m::MAX_VOLS; i++) {
        _requested_volumes[i] = true;
    }
    // fence here (seq_cst)

    wakeup_cleaners();

    while (true) {
        usleep(10000); // 10 ms

        bool remains = false;
        for (int vol = 0; vol < vol_m::MAX_VOLS; ++vol) {
            // fence here (seq_cst)
            if (_requested_volumes[vol]) {
                remains = true;
                break;
            }
        }
        if (!remains) {
            break;
        }
    }
}

A value in a boolean array _requested_volumes[i] tells whether thread i has work to do. When it is done, the worker thread sets it to false and goes back to sleep.

The problem I am having is that the compiler generates an infinite loop, where the variable remains is always true, even though all values in the array have been set to false. This only happens with -O3.

I have tried two solutions to fix that:

Declare _requested_volumes volatile (EDIT: this solution does work actually. See edit below)

Many experts say that volatile has nothing to do with thread synchronization, and it should only be used in low-level hardware accesses. But there's a lot of dispute over this on the Internet. The way I understand it, volatile is the only way to refrain the compiler from optimizing away accesses to memory which is changed outside of the current scope, regardless of concurrent access. In that sense, volatile should do the trick, even if we disagree on best practices for concurrent programming.

Introduce memory fences

The method wakeup_cleaners() acquires a pthread_mutex_t internally in order to set a wake-up flag in the worker threads, so it should implicitly produce proper memory fences. But I'm not sure if those fences affect memory accesses in the caller method (force_all()). Therefore, I manually introduced fences in the locations specified by the comments above. This should make sure that writes performed by the worker thread in _requested_volumes are visible in the main thread.

What puzzles me is that none of these solutions works, and I have absolutely no idea why. The semantics and proper use of memory fences and volatile is confusing me right now. The problem is that the compiler is applying an undesired optimization -- hence the volatile attempt. But it could also be a problem of thread synchronization -- hence the memory fence attempt.

I could try a third solution in which a mutex protects every access to _requested_volumes, but even if that works, I would like to understand why, because as far as I understand, it's all about memory fences. Thus, it should make no difference whether it's done explicitly or implicitly via a mutex.

EDIT: My assumptions were wrong and Solution 1 actually does work. However, my question remains in order to clarify the use of volatile vs. memory fences. If volatile is such a bad thing, that should never be used in multithreaded programming, what else should I use here? Do memory fences also affect compiler optimizations? Because I see these as two orthogonal issues, and therefore orthogonal solutions: fences for visibility in multiple threads and volatile for preventing optimizations.

Sounds like a data race and [data races are undefined behavior](http://stackoverflow.com/a/24682612/1708801) and one possible undefined behavior would be to turn it into an infinite loop. — Shafik Yaghmour, Aug 11 '15 at 14:43
volatile is not (has never been) used for threads. try [] of std::atomic<> — Richard Critten, Aug 11 '15 at 14:48
`volatile` should have worked for your example, at least there wouldn't be an infinite loop. Could you double check that every `_requested_volumes[i]` is `false`? — Serge Rogatch, Aug 11 '15 at 14:54
You might want to look at [this page](http://en.cppreference.com/w/cpp/atomic) to see what C++11 can offer regarding atomic operations. — Caninonos, Aug 11 '15 at 14:54
Re your comment about volatile: the read of the memory will not be optimized away; but the read(s) do not cause the CPU caches to synchronize (extra memory fence instructions are NOT inserted). So "old" values can/will be read from the cache and updates may not been seen. — Richard Critten, Aug 11 '15 at 14:56
@SergeRogatch Yes, I verified that with ```p *_requested_volumes@32``` in gdb (MAX_VOLS is 32) . It says ```{ false }``` — Caetano Sauer, Aug 11 '15 at 14:56
As for `volatile` see [this](http://en.cppreference.com/w/cpp/language/cv): "that is, within a single thread of execution, volatile accesses cannot be reordered or optimized out. This makes volatile objects suitable for communication with a signal handler, but not with another thread of execution, see std::memory_order" — Caninonos, Aug 11 '15 at 14:56
@RichardCritten Ok, but why don't the memory fences help either? — Caetano Sauer, Aug 11 '15 at 14:58
@CaetanoSauer, I would assume that GDB gives wrong status, rather than `volatile` doesn't work. Can you verify that with debug-printing the value of every `_requested_volumes[i]`? — Serge Rogatch, Aug 11 '15 at 14:58
@SergeRogatch You were right. Volatile does work and I didn't execute the test correctly. Please see my latest edit -- I would still like to understand the issue in more detail. Thanks! — Caetano Sauer, Aug 11 '15 at 15:15
Relying on `volatile` for inter-thread synchronization is UB according to C++11/14 — Anton, Aug 11 '15 at 15:23
@Anton Please read the question carefully. I am not relying on it for synchronization; I'm relying on it to prevent compiler optimizations. I did use fences for synchronization and that is exactly the point of the question. — Caetano Sauer, Aug 11 '15 at 15:27
Ok, I will. But currently it seems to miss important pieces of code like what type is `_requested_volumes`? How it is treated in the other thread.. — Anton, Aug 11 '15 at 15:31

score 5 · Answer 1 · answered Aug 11 '15 at 15:43

Many experts say that volatile has nothing to do with thread synchronization, and it should only be used in low-level hardware accesses.

Yes.

But there's a lot of dispute over this on the Internet.

Not, generally, between "the experts".

The way I understand it, volatile is the only way to refrain the compiler from optimizing away accesses to memory which is changed outside of the current scope, regardless of concurrent access.

Nope.

Non-pure, non-constexpr non-inlined function calls (getters/accessors) also necessarily have this effect. Admittedly link-time optimization confuses the issue of which functions may really get inlined.

In C, and by extension C++, volatile affects memory access optimization. Java took this keyword, and since it can't (or couldn't) do the tasks C uses volatile for in the first place, altered it to provide a memory fence.

The correct way to get the same effect in C++ is using std::atomic.

In that sense, volatile should do the trick, even if we disagree on best practices for concurrent programming.

No, it may have the desired effect, depending on how it interacts with your platform's cache hardware. This is brittle - it could change any time you upgrade a CPU, or add another one, or change your scheduler behaviour - and it certainly isn't portable.

If you're really just tracking how many workers are still working, sane methods might be a semaphore (synchronized counter), or mutex+condvar+integer count. Either are likely more efficient than busy-looping with a sleep.

If you're wedded to the busy loop, you could still reasonably have a single counter, such as std::atomic<size_t>, which is set by wakeup_cleaners and decremented as each cleaner completes. Then you can just wait for it to reach zero.

If you really want a busy loop and really prefer to scan the array each time, it should be an array of std::atomic<bool>. That way you can decide what consistency you need from each load, and it will control both the compiler optimizations and the memory hardware appropriately.

Thanks for the clarification. Does this mean that the experts are only right when it comes to C++11? From what I uderstand, memory fences alone don't solve the problem. So how did people solve this problem with older C++? — Caetano Sauer, Aug 11 '15 at 15:51
inline assembler, compiler/platform intrinsics or just using the available synchronization primitives (eg. semaphore) — Useless, Aug 11 '15 at 16:04
Besides the options you mentioned, another option would be a rendezvous/cyclic barrier (`pthread_barrier_*`). — ninjalj, Aug 11 '15 at 16:42

score 0 · Answer 2 · edited May 23 '17 at 12:06

0

Apparently, volatile does the necessary for your example. The topic of volatile qualifier itself is too broad: you can start by searching "C++ volatile vs atomic" etc. There are a lot of articles and questions&answers on the internet, e.g. Concurrency: Atomic and volatile in C++11 memory model . Briefly, volatile tells the compiler to disable some aggressive optimizations, particularly, to read the variable each time it is accessed (rather than storing it in a register or cache). There are compilers which do more so making volatile to act more like std::atomic: see Microsoft Specific section here. In your case disablement of an aggressive optimization is exactly what was necessary.

However, volatile doesn't define the order for the execution of the statements around it. That is why you need memory order in case you need to do something else with the data after the flags you check have been set. For inter-thread communication it is appropriate to use std::atomic, particularly, you need to refactor _requested_volumes[vol] to be of type std::atomic<bool> or even std::atomic_flag: http://en.cppreference.com/w/cpp/atomic/atomic .

An article that discourages usage of volatile and explains that volatile can be used only in rare special cases (connected with hardware I/O): https://www.kernel.org/doc/Documentation/volatile-considered-harmful.txt

edited May 23 '17 at 12:06

Community

1
1

answered Aug 11 '15 at 15:31

Serge Rogatch

13,865
7
86
158

If so many people discourage the use of volatile, how would you solve the problem of avoiding the optimization? What I'm really trying ot understand is whether proper synchronization (as you suggested with ```std::atomic```) would fix the problem without using volatile. In other words: are the two solutions orthogonal or does one include the other? – Caetano Sauer Aug 11 '15 at 15:36
Yes, `std::atomic` handles the undesirable optimization and the memory consistency _and_ the reordering problems. – Useless Aug 11 '15 at 15:48
When you write `volatile tells the compiler ... storing it in a register or cache`, how do you think will the compiler store something in a cache that is not a register? Utilizing registers is the job of a compiler, but how the hell do the compiler store something in the **cache**?! I would expect this ressource owned by the CPU. – harper Aug 12 '15 at 05:51
@harper, the compiler may issue some special CPU instructions on some platforms. – Serge Rogatch Aug 12 '15 at 07:39
Can you pls provide an example for at least one platform where the compiler **stores** any value intentionally in a (specific) cache cell? Can you tell us what platform allows to *address* any cache cell(s)? – harper Aug 12 '15 at 07:53
@harper, e.g. PreFetchCacheLine macro from WinAPI: https://msdn.microsoft.com/en-us/library/windows/desktop/ms684826(v=vs.85).aspx . But I meant in my comment that for `volatile` compiler would rather prevent insertion of caching instructions or insert instructions that prevent caching for the memory address of the `volatile` variable. – Serge Rogatch Aug 12 '15 at 08:08
I doubt that `PreFetchCacheLine` will be inserted in the code without request from the source code. – harper Aug 12 '15 at 12:38
@harper, it's just an example, see the assembler instructions to which this macro unrolls. Compiler is not restricted from inserting those instructions unless `volatile` is specified for the variable. – Serge Rogatch Aug 12 '15 at 13:10

Volatile vs. memory fences

2 Answers2

Linked