Why do we need both hardware support and software instructions for invalidating cache?

Question

Looking at Arm as an example, it has hardware support for automatic cache invalidation as explained in this URL: https://developer.arm.com/documentation/den0024/a/Multi-core-processors/Multi-core-cache-coherency-within-a-cluster

It also has software instructions to do the same manually, such as DC and SYS.

My question is, why and when would you need to ever run such instructions if its already covered automatically by hardware?

This question applies to any other architecture which supports both SW and HW cache invalidation.

Instruction caches aren't coherent on ARM, like on most non-x86 ISAs, so that's one major reason. Also I guess performance micro-optimization or other shenanigans on embedded systems. x86 for a long time didn't even have `clflush`, although it did have `wbinv` and `invd` to invalidate all cache system wide (with / without write-back first, `invd` usually only used when [leaving cache-as-RAM mode after configuring DRAM controllers in early boot firmware code](https://stackoverflow.com/q/41775371/224132)); x86 has always had cache-coherent DMA, since the first x86 CPUs with cache at all. — Peter Cordes, Feb 23 '22 at 03:39
If some ARM systems don't have cache-coherent DMA, that would be a major reason for such instructions to exist. — Peter Cordes, Feb 23 '22 at 03:40
"*Instruction caches aren't coherent on ARM*", is this related to the weak memory model of Arm? I thought cache coherence and the memory mode are 2 separate things. — Dan, Feb 23 '22 at 03:43
No, it's not really related. Except in terms of historical reasons for *why* x86 is that way: so x86 CPUs with instruction caches could be introduced without requiring software to be aware of it, still following the same rules as earlier CPUs for self-modifying code and JIT. That's also basically the reason for x86's strongly-ordered memory model, not introducing any new memory reordering since 486 SMP machine, still just program-order plus a store buffer with store-forwarding. (Apparently 386 SMP systems were sequentially-consistent, but that's incompatible with pipelined execution.) — Peter Cordes, Feb 23 '22 at 03:55
Makes sense. I guess having software instructions also helps with I/O peripherals on the bus which don't go through memory, or are memory mapped. In that case they may not even fall under the watch of the hardware itself and have out of date cache unless cleared manually? — Dan, Feb 23 '22 at 03:58
x86's reason for commercial success was/is backward compatibility with existing software. x86 CPU vendors bend over backwards to not break mainstream commercial software, e.g. making behaviour stronger than what they guarantee on paper. (e.g. see Andy Glew's answer and comment thread on [Observing stale instruction fetching on x86 with self-modifying code](https://stackoverflow.com/a/18388700) - he was one of the main architects of Intel's P6 (Pentium Pro) microarchitecture) — Peter Cordes, Feb 23 '22 at 03:59
Re: I/O peripherals: almost always, ranges of physical memory containing MMIO I/O-register regions should be marked as uncacheable, so CPU cache-control instructions are irrelevant. Device-memory (such as video RAM) can be different; e.g. on x86 you'd mark that region as uncacheable but with software write-combining allowed. (WC memory type). You could imagine maybe doing something different if the CPU wouldn't choke on it, if you had cheapish instructions to invalidate a line before a read like ARM does. — Peter Cordes, Feb 23 '22 at 04:03
But more likely relevant for device writes into host DRAM where you want that DRAM to be write-back cacheable. i.e. DMA from a disk controller when loading data (possibly including code) from a file. Or other device bus-master cases like a network controller. If the system doesn't have cache-coherent DMA, the kernel has to make sure there aren't stale cache entries that won't match data written to the underlying DRAM. And similarly, CPU has to make sure cache is written back to DRAM before letting a device read DRAM. — Peter Cordes, Feb 23 '22 at 04:05
I don't know ARM well enough to say if that's the main motivation, or how its "inner shareable" vs. other coherency domains work, otherwise I'd post this as an answer :/ — Peter Cordes, Feb 23 '22 at 04:06
Not a problem. When you say "*Instruction caches aren't coherent on ARM*", do you mean if a program tries to hot patch its own instructions then they may not be written back to memory? — Dan, Feb 23 '22 at 04:07
That's true but irrelevant. Main memory doesn't matter, just the "point of unification", i.e. some outer level of non-split cache. Store instructions will invalidate other *data* and *unified* caches, but not necessarily instruction caches. So the old stale machine-code bytes can still be fetched by this or another CPU, unless you make sure your L1d is evicted back to the point of unification, and that L1i is invalidated. Then instruction-fetch will pull the newly-stored data from L2 cache or whatever. (From main memory if it got evicted all the way there, e.g. split L1i/d and no L2) — Peter Cordes, Feb 23 '22 at 04:12
Thank you but how is it irrelevant? Why else would a process try to write instructions back to its own memory? I thought an easy example would be hot patching itself. — Dan, Feb 23 '22 at 04:16
It doesn't matter whether a store makes it all the way to main memory, just some level of cache that instruction-fetch pulls through on L1i miss. And if L1i is still valid with stale data, having the updated data in main memory doesn't help you at all. (Remember, L1i is *not* coherent, so it won't get invalidated automatically during write-back from L1d to L2 to main memory.) And then this gets extra complicated in a multi-core system when you want to write machine-code that can be executed by another thread, e.g a JVM being a typical example. — Peter Cordes, Feb 23 '22 at 04:22
Makes sense, but I'm just looking for a real world example where I will have to use a software instruction to clear cache and hardware won't do it for me, could self code hot patching be an example? — Dan, Feb 23 '22 at 04:24
I don't know ARM well enough to write one off the top of my head, but you could google ARM self-modifying code / JIT recipes. It will involve some cache-control instructions to make sure L1d is written back (to some point of unification) and L1i is invalidated, and a memory barrier to order those things. [Synchronizing caches for JIT/self-modifying code on ARM](https://stackoverflow.com/q/70635862) discusses one block of example code, with lots of detail for ARM64 specifically about what each instruction does. — Peter Cordes, Feb 23 '22 at 04:26
See also [How to synchronize on ARM when one thread is writing code which the other thread may be executing concurrently?](https://stackoverflow.com/q/39295261) — Peter Cordes, Feb 23 '22 at 04:34
@PeterCordes something interesting I learned from the link above is, hot patching instructions will end up in D-cache, not I-cache. So If I write instructions into memory (like jitted code) they will go through data cache, even though they are technically not data but instructions and I-cache is read only. — Dan, Feb 23 '22 at 15:12
Right, of course; my [earlier comments](https://stackoverflow.com/questions/71229898/why-do-we-need-both-hardware-support-and-software-instructions-for-invalidating?noredirect=1#comment125909447_71229898) were all based on that fact. ARM is not a full Harvard ISA, there aren't special load/store instructions to load/store program memory as opposed to data, so yes, all load/store instructions are treated as data loads/stores, going through L1d cache. Data you load/store isn't truly instructions until the CPU actually fetches it via the code-fetch path, not data load/store. — Peter Cordes, Feb 23 '22 at 22:28
Because sometimes the hardware does it wrong. Or at least, not in the way you'd like. — user253751, Mar 21 '22 at 17:54

Why do we need both hardware support and software instructions for invalidating cache?

0 Answers0