Is there an issue with "cache coherence" on C++ multi-threading on a Single CPU (Multi-Core) on Windows?

Question

(EDIT: Just to make it clear: The question of "cache coherence" is in the case that there is no use of atomic variables.)

Is it possible (A single CPU case: Windows can run on top of Intel / AMD / Arm CPU), that thread-1 runs on core-1 stores a bool variable (for example) and it stays in L1 cache, and thread-2 runs on core-n uses that variable, and it looks on another copy of it that is in the memory?

Code example (To demonstrate the issue, lets say that the std::atomic_bool is just a plain bool):

#include <thread>
#include <atomic>
#include <chrono>

std::atomic_bool g_exit{ false }, g_exited{ false };

using namespace std::chrono_literals;

void fn()
{
    while (!g_exit.load(std::memory_order_acquire))
    {
        // do something (lets say it takes 1-4s, repeatedly)
        std::this_thread::sleep_for(1s);
    }

    g_exited.store(true, std::memory_order_release);
}

int main()
{
    std::thread wt(fn);
    wt.detach();

    // do something (lets say it took 2s)
    std::this_thread::sleep_for(2s);

    // Exit

    g_exit.store(true, std::memory_order_release);

    for (int i = 0; i < 5; i++) { // Timeout: 5 seconds.
        std::this_thread::sleep_for(1s);
        if (g_exited.load(std::memory_order_acquire)) {
            break;
        }
    }
}

`std::memory_order_relaxed` gives you basically no guarantees on order of operations. It only makes sure the operation happens atomically. Are you sure that's what you want? — NathanOliver, Mar 21 '22 at 15:29
You should probably read up on how cpu caches work. Specifically cache invalidation — Taekahn, Mar 21 '22 at 15:58
It's a [common misconception](https://software.rajivprab.com/2018/04/29/myths-programmers-believe-about-cpu-caches/) that memory ordering has something to do with cache coherency or failure thereof. In fact, modern CPUs invariably have coherent caches, and C++ more or less guarantees it. Once your reads and writes make it to L1 cache, everything is good and they're globally visible. The reason for memory ordering issues is that your reads and writes may reach L1 cache sooner or later than you expected, and not necessarily in the same order that you coded them. — Nate Eldredge, Mar 22 '22 at 07:20
@NateEldredge , (lets put aside the memory ordering) In case the program above uses a plain bool, is there an issue with "cache coherency"? I tend to assume that the answer is: Yes, there is an issue. — Amit, Mar 22 '22 at 07:46
There is definitely an issue if you use plain `bool`; it is a data race and the C++ standard says it has undefined behavior. However, the issue does not have anything to do with cache coherency. Most CPUs would already be able to load and store a `bool` atomically with ordinary load and store instructions. So for this particular program, the main issue is that without `atomic`, the compiler may perform optimizations that would break it, as Peter's answer describes. — Nate Eldredge, Mar 22 '22 at 14:31
With relaxed order, there is no happen-before relationship. different thread can see different orders — Hui, Mar 27 '22 at 22:40

Peter Cordes · Accepted Answer · 2023-06-04T15:07:31.787

CPU cache is always coherent across cores that we run C++ threads across¹, whether they're in the same package (a multi-core CPU) and/or spread across sockets with an interconnect. That makes it impossible to load a stale value once the writing thread's store has executed and committed to cache. As part of doing that, it will send an invalidate request to all other caches in the system.

Other threads can always eventually see your updates to std::atomic vars, even with mo_relaxed. That's the entire point; std::atomic would be useless if it didn't work for this. ("Eventually" is often about 40 nanoseconds inter-thread latency; relaxed isn't worse for this, it just doesn't stall execution of later memory operations until a store to be visible to other threads like seq_cst needs to on most ISAs. Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees? no, or not significantly)

But without std::atomic, you code would be super broken, and a classic example of MCU programming - C++ O2 optimization breaks while loop and Multithreading program stuck in optimized mode but runs normally in -O0 - the compiler can assume that no other thread is writing a non-atomic var it's reading, so it can hoist the actual load out of the loop and keep it in a thread-private CPU register. So it's not re-reading from coherent cache at all. i.e. while(!exit_now){} becomes if(!exit_now) while(1){} for a plain bool exit_now global.

Registers are thread-private, not coherent in any way, so code written with plain int or bool can break this way even in a uniprocessor system. Context-switches just save/restore registers to thread-private kernel buffers, they don't know what the code was using registers for so won't ever create the effect of re-reading bool g_exit from memory into the thread's register. In fact the code might not even be re-checking a register after optimizing while(!non_atomic_flag){} into if(!non_atomic_flag) while(42){}

(Except that your sleep_for call would probably prevent that optimization. It's probably not declared pure, because you don't want compilers to optimize out multiple calls to it; time is the side-effect. So the compiler has to assume that calls to it could modify global vars, and thus would re-read the global var from memory (with normal load instructions that go through coherent cache)).

Also related: Why set the stop flag using `memory_order_seq_cst`, if you check it with `memory_order_relaxed`?

Footnote 1: C++ implementations that support std::thread only run it across cores in the same coherency domain. In almost all systems, there is only one coherency domain that includes all cores in all sockets, but huge clusters with non-coherent shared memory between nodes are possible.

So are embedded boards with an ARM microcontroller core sharing memory but not coherent with an ARM DSP core. You wouldn't be running a single OS across both those cores, and you wouldn't consider code running on those different cores part of the same C++ program.

For more details about cache coherency, see When to use volatile with multi threading?

Is there an issue with "cache coherence" on C++ multi-threading on a *Single CPU* (Multi-Core) on Windows?

1 Answers1

Is there an issue with "cache coherence" on C++ multi-threading on a Single CPU (Multi-Core) on Windows?