cache coherency issue between two cores on same processor

Question

Having two processes p1 & p2 each running on different cores say c1 & c2 (both cores are on the same physical processor). Both of these cores have the different L1 & L2 cache while sharing the common L3 cache. Both p1 & p2 use a pointer ptr (ptr being in shared memory). Process p1 initializes the ptr & p2 is supposed to simply use it. Facing a crash in p2 as it sees the ptr as 'NULL' initially (though after some time, possibly because of cache coherence the correct value of ptr is seen by p2). I have the following questions related to this :

How can the above situation arise (p2 seeing a null value of ptr), though some form of cache coherency protocol would have been used ?
In case of shared bus/memory architecture, different processors (on different sockets) usually follow bus snooping protocols for cache coherence. I want to know what are the cache coherence protocols being followed in case of two cores (both cores on the same physical processor) having their private l1/l2 cache while sharing a common l3 cache.
Is there a way to check what is the cache coherence protocol being used (this is for ubuntu 16.04 system) ?

And you are sure that p1 has initialized the ptr before p2 reads it? How is the memory shared? — Erki Aring, Apr 17 '20 at 16:17
from the debug logs, i can see that p1 has initialized the ptr. And p2 prints 'NULL' for the first time while trying to access it, and 2nd time onwards, it prints the correct value of ptr. Memory has been shared using the mmap. — mezda, Apr 17 '20 at 16:29

score 3 · Accepted Answer · answered Apr 17 '20 at 17:52

3

x86 is cache-coherent even across multiple sockets (like all other real-world ISAs that you can run std::thread across). x86's memory-ordering model is program-order + a store-buffer with store forwarding.

Formal model: A better x86 memory model: x86-TSO. Informally: http://preshing.com/20120930/weak-vs-strong-memory-models/

Lack of coherence is definitely not your bug. Once a store commits to L1d cache in one core, no other core can load the old value. (Because their copies of the line have all been invalidated so the core doing the modification can have exclusive ownership: MESI.)

Almost certainly p2 is reading the shared memory before p1 writes it. Coherence doesn't create synchronization on its own. If p1 and p2 both attach to the shared memory asynchronously, nothing stops p2 from reading before p1 writes.

You need some kind of data-ready flag which p2 checks with std::memory_order_acquire before reading the pointer. Or just spin on loading the pointer until you see a non-NULL value.

(Use mo_acquire on an atomic load of the pointer to avoid compile-time reordering, or runtime reordering on non-x86, with stuff you access later using that pointer. Or really only mo_consume would be needed for using a pointer, but compilers strengthen that to mo_acquire. That's fine on x86; acquire is free anyway.)

answered Apr 17 '20 at 17:52

Peter Cordes

328,167
45
605
847

Thanks @PeterCordes for the info. I had tested it multiple times and always the behaviour is that p1 has written ptr first. Later p2 tries to access it and sees the value as NULL (which is reflected correctly after some time though). Is it even possible in theory ? Are there any know drawbacks of cache coherence protocols ? Is there a way, using which i can check the cache coherency protocol being used in my system (ubuntu 16.04). – mezda Apr 17 '20 at 19:44
1

@mezda: No, that's not possible. Whatever you're using to determine that p1 has written `ptr` before p2 accesses it is incorrect, if they're both accessing the same physical memory (even via different virtual addresses from different processes). **The definition of p1's store happening before p2's load is that p2's load sees it.** – Peter Cordes Apr 17 '20 at 19:49
1

(If you're looking at the TSC or something on each core, remember that after executing a store locally, it doesn't become globally visible right away. It takes several clock cycles for it to retire and commit to L1d. But once it does, nothing else can load a stale value.) If you want more help debugging this, you're going to need a [mcve] that shows a simplified version of what your code is doing, but still shows the effect you describe. – Peter Cordes Apr 17 '20 at 19:54
1

@mezda: *Are there any know drawbacks of cache coherence protocols?* - they cost power and performance for cases when different cores are accessing different memory. NT stores can avoid doing an RFO before writing a full line, but that evicts the line from cache. However, a multi-core system without cache-coherence would be nearly un-programmable for multi-threaded code, requiring explicit flushing every time you want other cores to be able to see what you store. Non-coherent shared memory in a cluster of machines running separate kernels is normally used for MPI message passing. – Peter Cordes Apr 17 '20 at 19:57
1

GPUs don't always have coherent shared memory, especially not coherent with the CPU's view of memory. GPUs normally work on separate data. Also related: https://software.rajivprab.com/2018/04/29/myths-programmers-believe-about-cpu-caches/ / [When to use volatile with multi threading?](https://stackoverflow.com/a/58535118) / [Does a memory barrier ensure that the cache coherence has been completed?](https://stackoverflow.com/q/42746793) for much more about what cache coherence means, and memory ordering. (It doesn't mean seq_cst). @mezda – Peter Cordes Apr 17 '20 at 19:58
With respect to your comment ```Whatever you're using to determine that p1 has written ptr before p2 accesses it is incorrect```, i have put additional debug logs to print the value of ptr both when p1 writes it and when p2 tries to access it. As you are saying that it might take several clock cycles to store ptr in L1 cache, i think that is the reason for ptr being seen as NULL by p2. But then i am printing the value of ptr in p1 as well, in that case is it possible that though the value is still not written in the L1 cache but it is able to print the correct value of ptr from Registers ? – mezda Apr 17 '20 at 20:11
1

@mezda: Printing is very slow, requires locking, and requires a value *as a function arg*, not read from memory at the moment of printing after `stdio` locking establishes an order. (Or for separate processes, not threads in one process, `printf` has already formatted a value into a string and sent it to the kernel with `write` before locking in the kernel happens to decide which `write` system call copies its buffer onto the TTY fd first). This has zero connection to the ordering of writes and reads to some other shared memory. – Peter Cordes Apr 17 '20 at 20:22
1

@mezda: Probably what happens is that p2 loads either before p1's store even executes locally, or at least before receiving an RFO from p1 taking exclusive ownership of that line. When p1 eventually reloads its own stored value (forwarded from the store buffer if you used memory ordering of `std::memory_order_release` or weaker for the store, or from cache if you used the default `seq_cst` to it had to mfence) it of course sees what it stored. But if p2 has any cache misses or anything before printf can actually write() a string, p1 could have already made its write syscall. – Peter Cordes Apr 17 '20 at 20:26
1

@mezda: You are using C++ std::atomic or C11 `` right? Or writing assembly by hand? Anyway, if p1 has already printed something so stdio code is already "warmed up", but p2 hasn't, it's easy for p1 to win that race after the loads happen. Lazy dynamic linking, cache misses, or branch mispredictions, in p2 in user-space or in kernel code, could delay it and make its debug log message appear after p1's, regardless of ordering of the memory accesses. **The values in the messages tell you which order the load and store effectively happened it.** – Peter Cordes Apr 17 '20 at 20:30
1

@mezda: If you want an analogy, think of relativity: the meaning of "simultaneous" gets muddled when you can't truly observe both things at the same time. You can only sort out what happened from the values you see. (Unlike relativity, x86 guarantees that all observers can agree on a total order for all stores, though. Some machines don't have this guarantee and you can have 2 different reader threads that read 2 vars each disagree on the order of stores done by 2 different writer threads that write one var each. IRIW reordering.) – Peter Cordes Apr 17 '20 at 20:32
Thanks for your answers. I am using C and not aware much of c++, so could partially understand on what you told above. When you say ``` When p1 eventually reloads its own stored value (forwarded from the store buffer```, here does the store buffer refer to the cpu registers (because that is the only layer of memory between a core & L1 cache) ? Though a repetition it will be, just want to understand if a print (in debug logs) can happen from the cpu registers itself (even though the value has not yet been stored in the L1 cache) ? – mezda Apr 18 '20 at 14:20
1

@mezda: [C11 ``](https://en.cppreference.com/w/c/atomic) uses identical terminology with the same meaning as C++11 `std::atomic`, just without the `std::`. e.g. `atomic_store_explicit(&shared_var, 123, memory_order_release)` compiles to just a plain `mov` store, not an `xchg` or `mov+mfence` like you'd get from `memory_order_seq_cst` that's the default for `shared_var = 123;`. – Peter Cordes Apr 18 '20 at 16:44
1

The "store buffer" and other details aren't specified by C++, they're just how real CPUs work. It's not registers; it's a queue that decouples execution from cache access, e.g. to stop cache-miss stores from stalling execution. [Size of store buffers on Intel hardware? What exactly is a store buffer?](https://stackoverflow.com/q/54876208) If you use `shared_var = 123;` then that's a seq_cst store that waited for the store buffer to drain before reloading, so no, `printf("%d\n", shared_var)` can't have store-forwarded the value from the store buffer before it became globally visible. – Peter Cordes Apr 18 '20 at 16:44
1

@mezda: But if you used `atomic_store_explicit(&shared_var, 123, memory_order_release);` then yes, the reload by `local_tmp = shared_var;` can happen by *store forwarding* from the store buffer: a thread can see its own stores before they become globally visible, for orders weaker than seq_cst. **The key point in what I explained earlier though is that `foo(shared_var)` reads `shared_var` into a local temporary *before* calling.** The "race" to print first is a separate race from the race to store / load first. – Peter Cordes Apr 18 '20 at 16:50
i am using simple assignment i.e. ```ptr = val;``` in process p1 to set the value of ptr. From your replies, it seems that printf can print the values from store buffer as well (even though its not yet saved in L1 cache). One last question is what is the way to check the cache coherency protocol being used in my ubuntu 16.04 system ? – mezda Apr 21 '20 at 18:16
1

@mezda: Oh, with plain `int*ptr` or whatever then yes the compiler will assume that the value of `ptr` isn't changed by anything else ([because that would be data-race UB](//stackoverflow.com/q/58516052/)) so it can optimize `printf(.., ptr)` into `printf(.., val)`. You keep saying *printf* can print from whatever location, but that's not at all what's happening. It's *your code which calls printf* that has to get a value in a register (as an arg to printf). Remember that `printf` is call-by-value, not by reference, so it's not printf itself loading from the shared memory. – Peter Cordes Apr 21 '20 at 18:23
Thanks for the info. Is there a way to check the cache coherency protocol being used in a linux system ? – mezda Apr 21 '20 at 18:26
1

@mezda: *One last question is what is the way to check the cache coherency protocol being used in my ubuntu 16.04 system?* - It might be possible to design a microbenchmark that could detect the performance difference between Intel's MESIF cache coherency vs. AMD's MOESI cache coherency (transferring dirty data directly between cores instead of writing back to L3). [What is the benefit of the MOESI cache coherency protocol over MESI?](https://stackoverflow.com/q/49983405). As far as verifying correctness: you're already running Linux; the kernel would break if your CPU didn't work. – Peter Cordes Apr 21 '20 at 18:28
Thanks a lot for all the info. Really appreciate all your quick and informative replies. – mezda Apr 21 '20 at 18:30

cache coherency issue between two cores on same processor

1 Answers1