-6

There are questions on Stack Overflow with similar-sounding titles, but there is a critical difference between them and the question I'm asking (and not asking): see below in bold.

Say I have a time-consuming calculation. We'd like the fastest possible calculation and don't mind having several CPU cores spinning. I create a worker thread and hand half the calculation task to the worker.

The main thread sets an input variable for the worker to read, THEN sets the worker's output variable to a certain sentinel value.

The worker thread spins waiting for its output to be set to the sentinel, then reads the input, does the calculation, and overwrites the sentinel with the actual output.

Meanwhile, the main thread does its half of the calculation, then spins waiting for the output to change from the sentinel value it wrote, to any other value.

No race condition is possible, because only main sets the output variable to the sentinel, and only does so when it is non-sentinel. Only the worker sets the output variable to non-sentinel, and only does so when it is sentinel.

To end the child thread, the main thread sets a bool flag for the worker to exit, then sets the output variable to a sentinel. The worker sees the sentinel, checks the exit flag, and exits instead of doing the calculation.

I have made the input, output, and exit flag variables all atomic<>.

The question is: is this the fastest reliable way to communicate variables like double and bool between threads? Especially I in fact need to make all of input, output, and flag atomic? The software runs correctly over hundreds of millions of executions without atomic anything, but I suspect all three really should be atomic. Is there a faster way on a modern Intel CPU? I'm especially concerned about pernicious side effects on the cache lines: should I ensure the variables being spun on are on their own cache line?

(Also, the question is not: are spinning threads wasteful, is this a good general purpose solution for general purpose software, is this how a textbook would do it, are there more general-purpose ways to coordinate threads, etc.)

Nicol Bolas
  • 449,505
  • 63
  • 781
  • 982
Swiss Frank
  • 1,985
  • 15
  • 33
  • 1
    "*the question is not: are spinning threads wasteful*" If the answer to that question is "yes", wouldn't that necessarily mean that this isn't the "fastest" way to do what you're doing? – Nicol Bolas Apr 22 '23 at 18:42
  • 4
    You need to make them atomic. Micromanaging cache-line and other attempts to circumvent Undefined Behaviour related to unsynchronized concurrent access are a wasted effort. In the end, the as-if rule can always break your program's logic when you don't follow the rules, and the rules require synchronization. For example a `while(flag){ ... }` loop can become `while(true) { ... }` if `flag` is not synchronized. Use `std::atomic`. This communicates your intentions to the compiler, and it can generate good instructions for the target platform. – François Andrieux Apr 22 '23 at 18:59
  • 2
    Why are you trying to avoid `std::atomic` in the first place? With `memory_order` weaker than `seq_cst`, it compiles to about the same asm as plain operations on x86, except it can't optimize away stores/loads. On non-x86, only `relaxed` compile with no extra overhead or special instructions, but `acquire` and `release` are still cheap on AArch64. And you need at least acq/rel to make it safe to read non-atomic vars after seeing a flag update. – Peter Cordes Apr 22 '23 at 19:19
  • 1
    @FrançoisAndrieux: "*You need to make them atomic.*" He did. From the post: "I have made the input, output, and exit flag variables all atomic<>." – Nicol Bolas Apr 22 '23 at 19:41
  • 1
    I forgot you mentioned `atomic`. Unfortunately some compilers handle that inefficiently, even for operations that should be cheap like pure-load and pure-store. [C++20 std::atomic- std::atomic.specializations](https://stackoverflow.com/a/58945360) shows the less-efficient code-gen from GCC, bouncing through an integer reg. Clang gets it right. https://godbolt.org/z/qTEjqxM8v . But even with GCC, it's still pretty small overhead vs. inter-thread latency – Peter Cordes Apr 22 '23 at 20:52
  • You say this worked without `atomic`. Did you test with optimization enabled? In a debug build, everything gets spilled/reloaded between C++ statements, so it's somewhat similar to `volatile`, or to relaxed atomic. (Or on x86, to acq_rel atomic loads/stores, since compilers choose to align doubles by 8 bytes, and so on.) – Peter Cordes Apr 22 '23 at 20:55
  • *Why are you trying to avoid std::atomic in the first place?* -- I'm not trying to avoid. I am honestly asking whether a solution with them would be fastest. You suggest so! I've learned to take your word as bible on such matter. Without using `memory_order` I saw the atomic performance at more like 80ns per signal in contrast to 35ns. But I will try memory_order. – Swiss Frank Apr 23 '23 at 19:46
  • *If the answer to that question is "yes", wouldn't that necessarily mean that this isn't the "fastest" way to do what you're doing?* It depends. Some software is run on dedicated computers at least for dedicated time periods and if you've paid for cores they can be used nearly for free. In this case a 10-thread solution that's only 5% faster on the wall clock than the single-thread solution might be very desirable. – Swiss Frank Apr 23 '23 at 19:48
  • *He did.* -- Thanks @NicolBolas! François is right though: I *was* asking, and he confirmed what I was doing was indeed necessary. – Swiss Frank Apr 23 '23 at 19:50
  • *Unfortunately some compilers handle (`atomic`) inefficiently* -- yes, I went down that rabbit hole with GCC. To my recollection of my understanding of that time, 3-4 years ago, for a couple GCC versions I was otherwise interested in there was no combination of compiler flags that would generate a single-instruction for 8-byte atomics. Or maybe I'm thinking specifically of CAS?? I think we switched to clang at that point anyway and not only for that reason. – Swiss Frank Apr 23 '23 at 19:54
  • *You say this worked without atomic. Did you test with optimization enabled?* -- It worked absolutely for sure with optimization, but I was pretty convinced even when asking this question that it was just coincidentally working. – Swiss Frank Apr 23 '23 at 19:55

1 Answers1

1

Based on your description, you would need to make all three variables atomic to guarantee reliable operation.

Race Conditions

Contrary to what you seem to believe, you do currently have data races, at least as that term is defined with respect to C++ (§[intro.races]):

Two expression evaluations conflict if one of them modifies a memory location (6.7.1) and the other one reads or modifies the same memory location.

[ ... ]

Two actions are potentially concurrent if

  • they are performed by different threads, or [ ... ]

The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, except for the special case for signal handlers described below. Any such data race results in undefined behavior.

I'm not going to quote the whole definition of what it means for one action to "happen before" another, but the short summary is that you need something like a mutex or an atomic variable to make it happen when the things happen in different threads.

Optimization

Without code to test with, it's essentially impossible to be sure, but my general experience has been that when you want to communicate a flag between threads, it's generally faster (or at least as fast, anyway) to use an atomic bool rather than something like a floating point number.

So, rather than the main thread setting the input, then setting the output to a sentinel value to start the worker thread, I'd have a dedicated atomic bool specifically to say the input is ready.

Likewise, I'd use dedicated atomic bool's specifically to signal that the child thread should exit, and the output from the child thread is ready.

The risk with doing otherwise is that it's entirely possible for something like an std::atomic<double> to be implemented by protecting a double with a mutex (in which case, every load or store will have to lock the mutex, possibly involving a switch to kernel mode and back).

It's possible that the same is true of an atomic bool, but I've never seen such an implementation. There's a set of preprocessor macros that will tell you whether this is the case for any integer or pointer type (§[atomics.syn]):

ATOMIC_BOOL_LOCK_FREE
ATOMIC_CHAR_LOCK_FREE
ATOMIC_CHAR8_T_LOCK_FREE
ATOMIC_CHAR16_T_LOCK_FREE
ATOMIC_CHAR32_T_LOCK_FREE
ATOMIC_WCHAR_T_LOCK_FREE
ATOMIC_SHORT_LOCK_FREE
ATOMIC_INT_LOCK_FREE
ATOMIC_LONG_LOCK_FREE
ATOMIC_LLONG_LOCK_FREE
ATOMIC_POINTER_LOCK_FREE

But it's essentially certain that bool will be lock-free if anything is. double might be lock-free, but it's less likely.

So, if it were up to me, I'd use something along this general line:

std::atomic<bool> inputReady;
std::atomic<double> inputValue;

std::atomic<bool> outputReady;
std::atomic<double> outputValue;

// parent:
outputReady = false;

inputValue = 1234.567;
inputReady = true;

// do computation

while (!ouputReady)
    ;
use(outputValue);

// child thread:
while (!inputReady)
    ;

double in = inputValue;
outputValue = computeResultFrom(inputValue);
outputReady = true;

The signal to kill the child thread could be somewhat similar, but my immediate advice would to use a std::jthread, and use its built-in stop token for that job.

Summary

Using a dedicated atomic<bool>'s for signaling may or may not be faster, but almost certainly won't be any slower, and considerably more readable.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • 1
    Good compilers (e.g. GCC and clang) have lock-free `atomic`, at least when targeting Linux. IDK about Windows; I assume they're ABI-compatible with whatever MSVC decided to do. https://godbolt.org/z/qTEjqxM8v shows the asm. You can check at compile time with `std::atomic::is_always_lock_free`. https://en.cppreference.com/w/cpp/atomic/atomic/is_always_lock_free . GCC has some inefficiency loading an atomic double, so you might consider using a flag just to say the payload is ready. (And yeah, pure spinning without `.wait` / `.notify_one` if you value latency over throughput) – Peter Cordes Apr 22 '23 at 20:57
  • *pure spinning without `.wait` / `.notify_one` if you value latency over throughput* -- do you have a moment to expand on that, @PeterCordes? – Swiss Frank Apr 23 '23 at 20:02
  • 1
    @SwissFrank: basically, if you just have two threads involved, and can dedicate a core to running each, when one is waiting for another to set a value, it'll detect that change faster by constantly polling for it, so it can react immediately after it changes. In the other direction, if you have other threads that could run instead, as soon as a thread is blocked, you want to put it to sleep and run some other thread that's not blocked. But that may not wake up again as quickly after the event happens that would unblock it (e.g another thread might finish its time-slice first). – Jerry Coffin Apr 23 '23 at 20:37