cache coherency (particular case of cache physically tagged)

Question

Imagine that you have a process that has finished (not now in memory) but, when it was running, it used the 0x12345000 physical address (4KB pages). Now the MMU assigns the 0x12345000 (physical) to another process that just had started. But maybe you have in caché (physically tagged) the 0x12345 tag with the data of the previous process. This is a coherency problem. How is it solved?

EDIT: The assumption is: One process finish and another process is carried from disk to memory to that same page of memory to run. My question is: what is done to prevent problems in this? I understood that, before the 2nd process was brought to memory, the page was zeroed. So now in caches we have zeros corresponding to that page. But the page has the data of the second process. This is all I have understood, but probably is wrong.

Peter Cordes's answer is perfect!

I don't see a problem here. The OS could just let the new process see whatever stale data the previous process left in that page, either via cache hits or by filling from RAM if the dirty data was written back. Or more likely in real OSes, the kernel will zero that physical page (via its virtual address) before giving it to a new process, to avoid leaking data across process (and maybe user) boundaries. On a multi-core system, coherency is maintained by MESI or some variant, so whichever core the new process runs on, it doesn't matter whether cache is hot or not. — Peter Cordes, Jun 14 '20 at 11:25
The problem that I see, Peter, is that the new process sees in cache that the memory position that it wants to access is already in caché (which results in a caché hit). But the data that is going to use from caché is not the correct data, it's the data of the previous process which was mapped on that phisical direction and its data was still remaining in cache. (I'm new in stackoverflow and I don't know if this is the best way to respond your comment. I hope it reaches you). — isma, Jun 14 '20 at 11:41
You can notify people when you reply with @username. Sometimes Stack Overflow sends a notification anyway, like in this case, but better to be sure unless you're commenting under someone else's post (then they always get a notification). Also, "cache" doesn't have an accent on the e. The computing meaning comes from the noun https://www.merriam-webster.com/dictionary/cache meaning hidden supplies ("a cache of food"), and it's pronounced identically to "cash". Not [cachet / caché](https://www.merriam-webster.com/dictionary/cachet) meaning prestige. — Peter Cordes, Jun 14 '20 at 12:33
Back to your question: I think you're mixing up two concepts here: even in a system with no cache, the OS usually needs to zero a physical page (or load new data into it) before using it to back a virtual page in a different process. If it doesn't, the new process will see whatever stale data was left in it. Cache is irrelevant to this. A PIPT cache is totally transparent to the CPU (no aliasing problems); it caches based on physical address so it wouldn't care even if the OS mapped the same page to a different *virtual* address in another process. — Peter Cordes, Jun 14 '20 at 12:38
Also, "coherency" only applies between two caches for the same memory, e.g. in a multi-core CPU. https://en.wikipedia.org/wiki/Cache_coherence. Perhaps you're thinking of the cache aliasing problem? ([Definition/meaning of Aliasing? (CPU cache architectures)](https://stackoverflow.com/q/5947117)). A PIPT cache is immune to aliasing, and mapping the same page to the same virtual address also makes aliasing impossible with a VIPT or even VIVT cache. Any lines hot in cache will accurately reflect what's in that page of physical memory. — Peter Cordes, Jun 14 '20 at 12:44
@PeterCordes sorry I wrote "caché" instead of cache because I'm used to write "caché" because of it's the correct word in spanish grammar (i'm from spain). But back to my question: I think that I'm not explaining properly my doubt. I'll try to explain it better: — isma, Jun 14 '20 at 12:59
supose we have one process A which has been mapped at physical page number 0. And we are using this page so we map this page in cache memory (supose the whole page mapped and supose we have a PIPT or a VIPT). Now this process end up and another process starts and OS assign it the physical page 0. Now our CPU looks for the translation in PT and sees that the data that wants is, for example, at the 0x00000000 physical address. In this moment it looks in cache to see if it has the 0x0..0 tag and... it is! But the data remaining in cache is from the previous process! (the one which I named 'A'). — isma, Jun 14 '20 at 13:06
I want to say that in cache memory we have the "residual" data of the process A! and process B thinks that it is its data because the hit of the cache. I understand that the OS zero a physical page but this is in main memory, but I'm talking about the residual data in cache memory. — isma, Jun 14 '20 at 13:09
re: edit: what data was the load supposed to get, instead of 0? Where did that data originally come from? Are you once again imagining that something could have changed physical RAM without updating cache? (Like DMA of the new program's code / data from disk). Modern x86 systems have cache-coherent DMA. And regardless, the kernel wouldn't waste time zeroing a page that's about to be DMAed. For non-coherent DMA, you don't store zeros, you invalidate cache to avoid the problem of reading stale data from cache. — Peter Cordes, Jul 04 '20 at 23:18
Of course the new process's executable could already be in RAM, in which case the OS just has to set up the page tables of the new process to map those physical pages into its virtual address space, no disk I/O necessary. — Peter Cordes, Jul 04 '20 at 23:20
@PeterCordes the assumption is: One process finish and another process is carried from disk to memoryto that same page of memory to run. My question was: what is done to prevent problems in this. I understood that, before the 2nd process was brought to memory, the page was zeroed. So now in caches we have zeros corresponding to that page. But the page has the data of the second process. This is all I have understood. I hope that with this comment you can see exactly what I misunderstood. — isma, Jul 04 '20 at 23:26
Add *that* to your question, that's finally enough detail to answer. You were assuming non-coherent DMA, and that the next process's code/data wasn't already in RAM. Modern x86-64 has cache-coherent DMA, and pages that are about to be read in from disk don't need to be zeroed. — Peter Cordes, Jul 04 '20 at 23:29
@PeterCordes added. And Also I think that I understood the answer!! I wrote an "APROXIMATE ANSWER" in the question so you can read it and tell me if I'm right now (if you want I delete that paragraph after you reading it). In that paragraph you can find a little question in bold that I'm not sure of the answer. — isma, Jul 04 '20 at 23:50

Peter Cordes · Accepted Answer · 2020-07-05T00:31:46.723

But the data remaining in cache is from the previous process

Yes, that's what's supposed to happen. The cache just keeps track of what's in physical memory. That is its only job. It doesn't know about processes.

If the OS doesn't want the new process to see that data, the kernel needs to run some instructions to store new data to that page, overwriting cache and memory contents.

Cache is transparent to this operation; it doesn't matter whether data is still hot in cache, or whether the old process's data has been written back to RAM by the time the kernel reuses that physical page.

(See also comments under the question for some more details).

I understand that the OS zero a physical page but this is in main memory, but I'm talking about the residual data in cache memory.

I think this is the source of your confusion: this zeroing takes place with ordinary store instructions executed by the CPU. The OS runs on the CPU, and will zero a page by looping over the bytes (or words) storing zeros. Those stores are normal cacheable stores that are the same as any other write coming in at the top of the cache/memory hierachy.

If the OS wanted to offload the zeroing to a DMA engine or blitter chip that wasn't cache-coherent, then yes the OS would have to invalidate any cache lines in that page first to avoid the problem you're talking about, losing coherence with RAM. But that's not the normal case.

And BTW, "normal store" can still be pretty fast. e.g. modern x86 CPUs can store 32 or 64 bytes per clock cycle with SIMD instructions, or with rep stosb which is basically a microcoded memset that can internally use wide stores. AMD even has a clzero instruction to zero a full cache line. But these are all still CPU instructions whose view of memory goes through cache.

Loading new code/data for a new process

Modern x86-64 systems have cache-coherent DMA, making this a non-problem. This is easy in modern x86-64 when the memory controllers are built-in to the CPU, so PCIe traffic can check L3 cache on the way past. It doesn't matter what cache lines were still hot in cache from a previous process; DMA into that page evicts those lines from cache. (Or with non-DMA "programmed IO", the data is actually loaded into registers by driver code running on a CPU core, and stored into memory with normal stores, which again are cache-coherent).

https://en.wikipedia.org/wiki/Direct_memory_access#Cache_coherency
Some Xeon system can even DMA into L3 cache, avoiding main-memory latency/bandwidth bottlenecks (e.g. for multi-gigabit networking) and saving power. https://en.wikipedia.org/wiki/Direct_memory_access#DDIO

Older systems without cache-coherent do have to be careful to avoid stale cache hits when data in DRAM changes. This is a real problem, and it's not limited to starting a new process. Reusing a just-freed (munmapped) page for a new mmap of a different file has to worry about it. Any disk I/O has to worry about this, including writing to disk: you need to get data from cache synced to DRAM where it can be DMAed to disk.

This might require looping over a page and running an instruction like clflush, or the equivalent on other ISAs. (I don't know what OSes did on x86 CPUs that predate clflush, if there were ever any that weren't cache-coherent) You might find something about it in the Linux kernel's doc directory.

This LWN article: DMA, small buffers, and cache incoherence from 2002 might be relevant. At that point, x86 was already said to have cache-coherent DMA, so maybe x86 has always had this. Before SSE, I don't know how x86 could reliably invalidate cache except for wbinv which is extremely slow and system-wide (invalidating all cache lines, not just one page), not really usable for performance reasons.

Either way (coherent or not), an OS wouldn't waste time storing zeros to pages it was about to read from disk. Zeroing is done for a new process's BSS, and any pages it allocates with mmap(MAP_ANONYMOUS), not for its code/data sections.

Also, the executable you're executing as a new process could already be in RAM, in which case you just have to set up the new process's page tables.

I was reading your answer again and there is one thing I don't understand: if you write '0' in that data, is the validity bit still on? If yes: The following process with that physical address would make a false hit and read all zeros (which is valid data too). If not: In caches L2 and L3 there would still be the data of the previous process that occupied that position of physical memory, without having been zeroed. So in either case it would not be working. What am I missing? — isma, Jul 04 '20 at 17:42
@isma: With physically tagged cache, there are no false hits, ever. The CPU always reads whatever the last thing written to memory was, *because* cache is coherent. The CPU doesn't care whether the store was done by the kernel or by user-space. Other cores also can't read stale data from an outer level of cache; MESI coherency guarantees that when stores commit to L1d cache, this core has exclusive ownership of the line, with outer caches invalid or tracking that ownership. TL:DR: cache is coherent. After you write a zero to a physical address, it reads as zero. — Peter Cordes, Jul 04 '20 at 22:15
@isma: but yes, writing zeros isn't special. The next process can certainly get cache hits in the recently-zeroed pages. The whole point of physically-tagged (not virtually-tagged) caches is that it doesn't have to be invalidated on context switches, including ending one process (freeing its pages) and starting another one (recycling those pages). IDK if you were thinking about storing `0` over data that was already zero? [What specifically marks an x86 cache line as dirty - any write, or is an explicit change required?](//stackoverflow.com/q/47417481) - silent store isn't done. — Peter Cordes, Jul 04 '20 at 22:18
Peter, what I don't understand is: kernel zeroes a page and MESI coherency protocol invalidates all the copies that are in other L1d caches and L2 and L3 cache. But in the L1d that the zeroing has happened we have that line still valid, right? So, in a later read, if a program wants some data that is placed in a physical direction that corresponds to that tag, it will be a hit and it will found all zeros in the data. And this is not correct! So maybe there is something about the MESI protocol or about zeroing that I still don't understand. — isma, Jul 04 '20 at 22:31
@isma: Any location that shouldn't read as zeros needs to get written by the program-loader, i.e. loading the new program's code and data from disk instead of zeroing pages. Any pages that *should* read as zeros (the new process's BSS) *should* get zeroed by the kernel. The cache doesn't know or care about separate processes, it just makes sure that reads see whatever was most recently written. It's up to the OS to make sure that the most recently written data was correct. This isn't a caching issue, it's just an OS correctness issue that would apply without cache. — Peter Cordes, Jul 04 '20 at 22:35
Hi Peter, I commented yesterday in my original question but I think that you didn't see the comment. I have read your edited answer and, because of that English is no my native language and because I want to be sure of what you said, I rephrased your answer with my words in the paragraph that I added to the original question. Could you read the added paragraph that starts with "APROXIMATE ANSWER" and tell me if it's correct what I wrote there, please? Also in that paragraph you will find a little question in bold (that question only needs one or two words answer). Thanks in advance!! — isma, Jul 05 '20 at 13:13
@isma: Yes, I did read that yesterday; I didn't comment because you'd already changed the phrasing to mention my updated answer which says the same thing. Anyway yes, looks like you've understood correctly and can delete that part now. BTW, there doesn't have to be a "DMA engine" in modern PCI systems; PCI devices can bus-master themselves. As far as cache coherency is concerned, it doesn't matter what device is storing to memory (a "dma engine" or a PCI card itself); either way the system agent sees it as traffic from a device to memory and snoops or invalidates L3 on the way. — Peter Cordes, Jul 05 '20 at 13:21
Of course, right now I delete it. Again, thank you so much! Just before I delete it: Could you answer me in a couple of words who is the one in charge of invalidating the cache entries in the case we were discussing about: The OS or the DMA? (think on modern systems) — isma, Jul 05 '20 at 13:24
@isma: See my edit to my previous comment. Hardware does it, that's what having cache-coherent DMA means (see my answer). On modern Intel, the "System Agent" is one of stops on the ring bus and connects the ring bus to off-chip PCIe devices. There isn't a "DMA engine", DMA is just something that any PCI / PCIe device can do. — Peter Cordes, Jul 05 '20 at 13:27

score 1 · Answer 2 · answered Jun 18 '20 at 17:29

When the first process terminates, all of its physical memory pages are "freed" by the OS. In almost all cases, the kernel zeros the contents of these newly freed pages (this invalidates any cached copies of those physical addresses anywhere in the system) and "shoots down" the corresponding TLB entries (so no TLB retains a mapping from the previous virtual address to the physical address). Only after each TLB entry has been "shot down" and each page has been zeroed, can the kernel add that page to a "free list", at which point it becomes eligible for re-use.

There are many variations on this pattern, depending on the capabilities of the hardware and the preferences of the OS developers. I seem to recall that in the SGI IRIX operating system for MIPS processors, the TLB shootdown was done implicitly. The MIPS hardware had the capability of invalidating a TLB entry based on its number (rather than its contents). The OS would shoot down one TLB entry every 10 milliseconds, then increment the pointer for the next interval. After 32 (or 64?) of these 10 millisecond intervals, you were guaranteed that all TLB entries in the system had been flushed -- so any page freed more than 1 second ago was guaranteed to have no stale TLB entries, and could be re-used (after zeroing, of course). This seems like a reasonable approach for a scalable shared-memory system like the SGI Origin 2000.

I think modern Linux does lazy / last-minute zeroing, so the fresh page will be hot in L1d cache for the process that just soft-pagefaulted when writing it for the first time. And so there's no need to track separate lists of free clean and free dirty pages. Also, if the first use is write-only by the kernel (e.g. as a DMA target for a disk read), there's no benefit to having zeroed it earlier. Interesting point about TLB, though, for pages of a multi-threaded process that might have been running on multiple cores. — Peter Cordes, Jun 18 '20 at 17:38

cache coherency (particular case of cache physically tagged)

2 Answers2

Loading new code/data for a new process