Cache coherence state machine

Question

Let's say we have a multi core machine. Core 1 tries to write to variable X, it doesn't have that variables cache line in its L1d cache so it broadcasts a RFO . In the mean time, it writes the store into the store buffer in core 1 as it didn’t get acks from every core for exclusive ownership of the cache line containing X.

Now, Core 1 gets request for ownership of cache line where X is.

(Imagine that now core 2 is trying to modify variable X - maybe this is a shared counter that the threads are doing a x+=local_count on)

Will the core ack this request and store it in the invalidation queue or will the core not respond to it since it is still in the middle of a phase transition to exclusive mode? I feel like resolving either of the scenarios might involve some overhead and bookkeeping unless I am missing something very basic.

I really don't think the C++ tag is appropriate. This is not even tangentially related to the C++ language. — Nate Eldredge, Nov 29 '21 at 03:17
*In the mean time, it writes the store into the store buffer in core 1* - That would actually happen *first*, before the RFO. ([Can a speculatively executed CPU branch contain opcodes that access RAM?](https://stackoverflow.com/q/64141366)). Sending RFOs would typically be done as a store instruction retires (becomes non-speculative) or as it approaches the head of the store buffer, close to committing. (Maybe related [Why doesn't RFO after retirement break memory ordering?](//stackoverflow.com/q/62376976), although I think there was a more specific discussion about when RFOs are started) — Peter Cordes, Nov 29 '21 at 04:07
(That point about when RFOs are sent vs. when OoO exec writes address and data into the store buffer doesn't invalidate your question or the premise, just a minor nitpick in description). — Peter Cordes, Nov 29 '21 at 04:15
Intel CPUs have some kind of contention manager IIRC, to give some fairness in which core gets ownership next when many are hammering on the same line with atomic RMWs. Since modern CPUs actually use directory-based coherence (e.g. using inclusive L3 tags on Intel since Nehalem, except for recent server chips with a mesh interconnect and non-inclusive L3), I guess there's some queueing of requests there, not just in individual cores? I'm not sure on the details, but it does at least work as a snoop filter so all cores aren't flooded with RFOs from each other. — Peter Cordes, Nov 29 '21 at 04:23

Cache coherence state machine

0 Answers0