Are CPU caches flushed to memory during I/O?

Question

Let's say my server program has two threads (T1 and T2) running on separate cores. Both are serving RPCs coming in over the network from a single external client. The following sequence of operations occurs:

In-memory variable foo is initialized to zero
Client sends RPC, which happens to be served by T1, to set foo to 42
T1 writes value to foo, write is cached in its core's L1 (not main memory)
T1 sends ACK to client
Client sends RPC, which happens to be served by T2, to read foo
T2 reads foo from its cache or main memory and sees that it is zero
T2 replies to client saying foo is zero .

This violates external consistency.

Can this actually occur, or is there an implicit flush of T1's cache when it performs the I/O of sending the ACK back to the client (step 4)?

on x86 and x64 series, cache is coherence and you'll get 42 on T2 even they do not share the same cache unit. — Non-maskable Interrupt, Jan 09 '18 at 02:34
*Client sends RPC* - think any RPC implementations used locked instructions. as result we have `T1:mov [foo],42; lock *` and later `T2:lock*; mov r,[foo]`. then simply x86 rules - Locked Instructions Have a Total Order, Loads and Stores Are not Reordered with Locks — RbMm, Jan 09 '18 at 15:22

score 3 · Accepted Answer · edited Jan 14 '18 at 07:45

3

On x86 and x64 series, all caches are coherent and you'll get 42 on T2 even if the two threads do not share the same cache unit.

Your thought experiment can be further reduced to 2 cases: the two threads share the same cache unit (multi-core) or do not share (multi-cpu).

When they share a cache unit, both T1 and T2 will use the same cache, therefore they will both see 42 without any synchronization to memory.

In case the caches are not shared (i.e., multi-cpu), the ISA requires that the cache units be synchronized, and that will be transparent to software. Both threads will see 42 at the same address. This synchronization introduces some overhead though, therefore nowadays multi-core design is preferred (beside the reason cache is expensive).

edited Jan 14 '18 at 07:45

prl

11,716
2
13
31

answered Jan 09 '18 at 02:40

Non-maskable Interrupt

3,841
1
19
26

Each core in a multi-core CPU has its own private L1 cache. Modern Intel CPUs have private L2 caches as well. It's only the last-level cache that's shared between cores. The fundamental difference is between hyperthreads on the same core ([which do share an L1D but not a store queue](https://stackoverflow.com/questions/32979067)) vs. separate cores with [MESI coherency](https://en.wikipedia.org/wiki/MESI_protocol) between their caches, whether that coherency traffic has to go across an external bus between sockets or not. – Peter Cordes Jan 09 '18 at 03:29
@PeterCordes Thanks for the note, I edited for clarify. I was trying to stay overview and talking the cache as a whole for easier understanding. But yes the cache hierarchy is way more complex and it have different types as well. – Non-maskable Interrupt Jan 09 '18 at 12:16

score 3 · Answer 2 · answered Jan 09 '18 at 02:44

At step 3, before T1 modifies the value, it acquires the cache line as "exclusive", meaning that it is not present in any other threads' caches, and sets the cache line state to "modified".

At step 6, T2 does not have the value in its cache, so when it goes to get the value, the cache coherency protocol finds the modified line in T1's cache. The state of the cache line is set to "shared" in both T1's cache and T2's cache.

Jerry Coffin · Answer 3 · 2018-01-09T02:48:26.390

2

On x86, the caches maintain consistency to assure against problems like this.

The first part of this is that each cache keeps track of the state of each line it holds¹. If (to use your example) a single piece of data is simultaneously held in two caches, and one writes to it, it will set its cache line to the "modified" state, and send a signal to the other CPU to tell it to set its cache line that used to hold the same data to the "invalid" state.

The second part of the puzzle is that each CPU "snoops" all memory transactions (by other CPUs or by bus-mastering PCI devices) so it "sees" when somebody else is trying to read data that's in its cache. When that happens, it forces a pause in that transaction, writes the data from its cache out to memory, then lets the transaction proceed after the data's been written so it'll get the current data.

The class set of states is Modified, Exclusive, Shared, and Invalid (MESI). Most modern CPUs add at least one more state (often "Owned", giving MOESI), and some add still more. Virtually all include at least Modified and Invalid though.

edited Jan 09 '18 at 02:48

answered Jan 09 '18 at 02:34

Jerry Coffin

476,176
80
629
1,111

cache-coherent DMA is more recent development (maybe as recent as the last 10 years, when memory controllers were moved on-die. i.e. integrated northbridge). Before that, x86 did have to explicitly flush cache lines before using them as the source or destination for DMA transfers, otherwise cache writeback could step on the data that was just DMAed there. But caches in other CPUs have always been coherent, so it's a mistake to talk about DMA as similar to writes by other CPUs, unless you're talking about very special more-than-8-socket machines with (some) non-coherent memory. – Peter Cordes Jan 09 '18 at 03:14
For normal multi-socket systems, [MESI](https://en.wikipedia.org/wiki/MESI_protocol) operates between sockets as well as between cores in the same socket. They're all part of the same coherency domain, so if one core has a Modified copy of a line in its L1D or L2 cache, no other core can have it in any state but Invalid. Thus there's no need to snoop traffic from other cores. (I'm not sure exactly how NT stores fit in. They avoid needing an RFO to get exclusive access to the line before it flushes, but perhaps as part of the store it does get other copies to discard instead of write-back). – Peter Cordes Jan 09 '18 at 03:23
@PeterCordes: I'm not sure I any longer have books about, for example, the Pentium Pro to check for sure, but your comments certainly don't fit my recollection of the situation at all. There are certainly other CPUs (e.g., DEC Alpha) that define much looser coherency than x86. x86 CPUs have certainly been documented as using bus snooping for a *long* time (and all the behavior I've seen seems to fit with this). – Jerry Coffin Jan 09 '18 at 03:38
Actually you're right, there is snooping for multi-socket systems. I was forgetting. Dual-socket Intel Xeons just broadcast their L3 misses for the other socket to snoop, but chips capable of quad-socket and higher have snoop filters (small caches to keep track of what other sockets do / don't have cached on chip). This is basically an implementation-detail of MESIF, though, and still a different thing from DMA -> host transfers, because that's coming from outside the coherency domain. – Peter Cordes Jan 09 '18 at 04:02
Single-socket modern Intel CPUs don't need to snoop between cores because the shared L3 is inclusive, so checking L3 tells you whether a line is on-chip anywhere (even in Modified state in a private L1D somewhere when L3 is thus Invalid), and if so where. Except in Skylake-AVX512 where L3 isn't inclusive, but there's probably still a tag-inclusive structure somewhere that keeps track of what's cached on-chip. It would be too expensive (in power if nothing else) to broadcast transactions to all 28 cores for them to check their private caches separately. – Peter Cordes Jan 09 '18 at 04:06
@PeterCordes - well the L3 just acts as a snoop filter for the cores, so _conceptually_ all the cores are snooping the traffic of other cores, it's just that the snoop doesn't need to go all the way to core if the L3 indicates that the core can't possibly have the line. If the L3 indicates that the private caches of a core (may) have the line, then the "snoop" still has to go - maybe "snoop" is confusing term now due to the certainty. I'm not actually sure if the L3 always has the up to date state: if a line is silently dropped from the L2 does it update its MESI state in L3 eagerly? – BeeOnRope Jan 09 '18 at 04:55
@BeeOnRope: Yes, thanks, that's a clearer / better way of saying it. I don't think L2 / L1D can drop a modified line without writing it back, though (the `invd` instruction affects all lines in all caches), so either write-back went through L3 (and it now has a valid copy of the line, so can respond to read requests from its own copy) or an L3 tag-match which says a core has a Modified copy of the line means it really does have a modified copy. But yes, L2 / L1 caches could get invalidate requests for lines they already dropped, from an RFO by another core. Good point: not a perfect filter. – Peter Cordes Jan 09 '18 at 05:03
Snooping is never (or least nearly never) actually done by the core proper--it's handled in the cache controller, mostly asynchronously from the core itself (and at least in some processors, continues in states where the cores themselves are completely shut down). A quick check of the Skylake-X datasheet seems to indicate that it still does snooping of traffic initiated externally by PCI-Express bus agents. – Jerry Coffin Jan 09 '18 at 05:07
JerryCoffin - snoops/invalidate requests have to go, in some cases, all the way to the L1 (e.g., when a core has a modified line in their L1) and so that part is most certainly handled by the core. @PeterCordes - yeah I was talking about the case where the L2 silently dropped a not-modified line: unless there is an eager update to L3 to change the line's state, you'll get snoops for those lines even though they don't exist any more. – BeeOnRope Jan 09 '18 at 05:25
@BeeOnRope: Right, but only from RFOs / Invalidates, not from reads. For reads, if an inclusive L3 thinks an unmodified line exists somewhere, it has its own copy because it's inclusive. So that's a significant reduction in the amount of snooping for silently-evicted lines. (But Skylake-X's L3 isn't inclusive, so it presumably does just have to forward requests to cores that *might* have a copy). Hmm, or maybe cores do have to notify L3 when they drop a line, so L3 can favour dropping that line next time it needs to allocate a new line in that set. Fewer conflict misses. – Peter Cordes Jan 09 '18 at 05:33
@JerryCoffin: Yeah, the system agent has to snoop L3 on the way from PCIe to the memory controller to make DMA cache-coherent. One of the points I was making earlier was that DMA traffic is different from other cores (which participate in MESI). Device memory isn't coherent with system memory (although DMA reads/writes are these days on x86), but other CPUs are fully coherent. So it's confusing to lump them together. – Peter Cordes Jan 09 '18 at 05:37
Yeah some of the confusion arises from the fact that the OP mentions "I/O" in the title but the scenario as described is totally about core-to-core coherence and the IO doesn't come into it. – BeeOnRope Jan 09 '18 at 05:47

Are CPU caches flushed to memory during I/O?

3 Answers3