But the data remaining in cache is from the previous process
Yes, that's what's supposed to happen. The cache just keeps track of what's in physical memory. That is its only job. It doesn't know about processes.
If the OS doesn't want the new process to see that data, the kernel needs to run some instructions to store new data to that page, overwriting cache and memory contents.
Cache is transparent to this operation; it doesn't matter whether data is still hot in cache, or whether the old process's data has been written back to RAM by the time the kernel reuses that physical page.
(See also comments under the question for some more details).
I understand that the OS zero a physical page but this is in main memory, but I'm talking about the residual data in cache memory.
I think this is the source of your confusion: this zeroing takes place with ordinary store instructions executed by the CPU. The OS runs on the CPU, and will zero a page by looping over the bytes (or words) storing zeros. Those stores are normal cacheable stores that are the same as any other write coming in at the top of the cache/memory hierachy.
If the OS wanted to offload the zeroing to a DMA engine or blitter chip that wasn't cache-coherent, then yes the OS would have to invalidate any cache lines in that page first to avoid the problem you're talking about, losing coherence with RAM. But that's not the normal case.
And BTW, "normal store" can still be pretty fast. e.g. modern x86 CPUs can store 32 or 64 bytes per clock cycle with SIMD instructions, or with rep stosb
which is basically a microcoded memset that can internally use wide stores. AMD even has a clzero
instruction to zero a full cache line. But these are all still CPU instructions whose view of memory goes through cache.
Loading new code/data for a new process
Modern x86-64 systems have cache-coherent DMA, making this a non-problem. This is easy in modern x86-64 when the memory controllers are built-in to the CPU, so PCIe traffic can check L3 cache on the way past. It doesn't matter what cache lines were still hot in cache from a previous process; DMA into that page evicts those lines from cache. (Or with non-DMA "programmed IO", the data is actually loaded into registers by driver code running on a CPU core, and stored into memory with normal stores, which again are cache-coherent).
https://en.wikipedia.org/wiki/Direct_memory_access#Cache_coherency
Some Xeon system can even DMA into L3 cache, avoiding main-memory latency/bandwidth bottlenecks (e.g. for multi-gigabit networking) and saving power. https://en.wikipedia.org/wiki/Direct_memory_access#DDIO
Older systems without cache-coherent do have to be careful to avoid stale cache hits when data in DRAM changes. This is a real problem, and it's not limited to starting a new process. Reusing a just-freed (munmap
ped) page for a new mmap
of a different file has to worry about it. Any disk I/O has to worry about this, including writing to disk: you need to get data from cache synced to DRAM where it can be DMAed to disk.
This might require looping over a page and running an instruction like clflush
, or the equivalent on other ISAs. (I don't know what OSes did on x86 CPUs that predate clflush
, if there were ever any that weren't cache-coherent) You might find something about it in the Linux kernel's doc directory.
This LWN article: DMA, small buffers, and cache incoherence from 2002 might be relevant. At that point, x86 was already said to have cache-coherent DMA, so maybe x86 has always had this. Before SSE, I don't know how x86 could reliably invalidate cache except for wbinv
which is extremely slow and system-wide (invalidating all cache lines, not just one page), not really usable for performance reasons.
Either way (coherent or not), an OS wouldn't waste time storing zeros to pages it was about to read from disk. Zeroing is done for a new process's BSS, and any pages it allocates with mmap(MAP_ANONYMOUS)
, not for its code/data sections.
Also, the executable you're executing as a new process could already be in RAM, in which case you just have to set up the new process's page tables.