Which cpus have explicit cache flush assembly instructions?

Question

As a followup to this question, I'd like to know which real-world cpus, if any, have instructions for explicitly flushing the cache/writing to the main memory. And under what circumstance are these instructions used?

Do compilers on these platforms need to emit these instructions for volatile writes (i.e. writes that have visibility guarantees in some languages)?

One example I'm vaguely aware of is IBM's Cell processor used in the Playstation 3, which is said to have had incoherent caches which had to be manually flushed by the software.

what part of the documentation for the tagged architectures do you not understand? volatile is vague in this question, and how would that have anything to do with caches? there is no magic in your other question either. — old_timer, Jul 07 '23 at 06:52
would be silly to think the compilers have anything to do with it, even if they know the target instruction set they cant know the system architecture, perhaps x86 but very much not on arm. your prior question volatile simply causes the writes to not get optimized out and happen in that order, so that when finished is asserted, result has compileted. no cache magic, etc. just compile the stores in order. You can just look at the code the compiler generates to answer that and this question on your own. — old_timer, Jul 07 '23 at 06:59
if a language like C# in your example essentially dictates that functionality that indicates to the compiler folks that they cannot have threads across cores (with separate caches per core). The compiler does not and cannot have the detailed info needed otherwise. Now the compiler, may, use instructions that work the pipe such that not only is the write explicit in the code (vs being optimized out, or done later), but some sort of fence/block/whatever happens to force the buffers to complete (write buffer...) before the next instruction can continue. I doubt seriously you will see that though — old_timer, Jul 07 '23 at 07:04
here or prior question please post complete examples including disassembly of the compiler generated code, so that there is something to write an answer about. — old_timer, Jul 07 '23 at 07:05
Your title and question body are different questions. Cache-coherency protocols like MESI ensure visibility even when a line is in Modified state in the private L1d cache of one core. That core doesn't have to share it or write it back until requested. Cache-control instructions like x86 `clwb` or ARM https://developer.arm.com/documentation/den0013/d/Caches/Invalidating-and-cleaning-cache-memory (I thought they had instructions for writing back a single line; that example from the manual shows write-back and disabling the whole cache) can and do exist for reasons other than visibility. — Peter Cordes, Jul 07 '23 at 13:20
I edited the question, tags, and title. Hope that my intention is clearer now. — Malt, Jul 08 '23 at 01:17
The majority of CPUs with cache-flush instructions don't need them for visibility between threads in normal situations. That's fairly rare; ccNUMA is nearly universal for SMP machines. Heterogeneous architectures like Cell, and some ARM boards with a DSP + microcontroller, exist with shared but non-cache-coherent memory. Oh also, I think quite a few architectures have DMA that isn't cache-coherent, so they need to flush before starting DMA. — Peter Cordes, Jul 08 '23 at 02:30

Nate Eldredge · Answer 1 · 2023-07-10T18:43:54.347

arm armv8

ARMv8 has an elaborate set of dc and ic instructions to manage the data and instruction cache respectively.

Their exact description is more complicated than just "flush L2 data cache" etc, because ARMv8 has an abstract cache model that's based instead on "points" where the caches might diverge from each other. Cache lines can be invalidated or written back ("cleaned") either by virtual address or by explicit level/set/way address. So you have instructions like dc civac, Data Cache Clean and Invalidate by Virtual Address to Point of Coherency.

The full definition of the cache model takes about 30 pages in the ARMv8 Architecture Reference Manual (Section D7.4), and about another 60 pages to describe the actual cache maintenance instructions (C5.3).

Most of this is irrelevant to the application programmer, and indeed, many forms of the ic/dc instructions are privileged and cannot be executed by an application anyway. The ARMv8 memory model guarantees that data caches are coherent across the "inner shareable domain", which includes all cores that might be running threads of your program, so no explicit cache management instructions are needed for sharing variables with other threads. (You still need memory barriers like ldar/stlr/dmb when you need to ensure that loads and stores commit to the coherent cache in a particular order, as for acquire, release or sequential consistency.)

One aspect that can affect application programming is that, unlike on x86, the instruction and data caches are not unified nor coherent with each other. Therefore, when you write data that is later going to be executed as instructions, such as when loading a binary or JIT compiling, you do need to explicitly clean the relevant lines from the data cache to the "point of unification", then invalidate them from the instruction cache, and finally execute a synchronization barrier (isb) to flush any instructions already prefetched from cache. See Synchronizing caches for JIT/self-modifying code on ARM for more details.

Memory-mapped I/O registers should be marked as "device memory" in the page tables set up by the kernel, which automatically exempts them from caching and reordering, so you do not need explicit cache flushes or barriers to access them; ordinary loads and stores are enough. Some systems might need explicit cache management for DMA.

C/C++ compilers do not emit any cache maintenance instructions (nor barriers) for volatile reads and writes. All you get is the usual guarantee that a volatile read/write results in the execution of exactly one load/store instruction. As mentioned above, this should be sufficient for memory-mapped I/O access, which is the main legitimate use for volatile in C/C++. If you are doing something else for which cache maintenance is actually needed, then you have to insert those instructions yourself. For the JIT situation described above, gcc/clang provide __builtin_clear_caches().

Other languages like C#/Java have different semantics for volatile, more like C _Atomic or C++ std::atomic. In this case you would get memory barriers but still no cache maintenance.

x86 doesn't have *unified* instruction / data caches either, but unlike ARM they have to be *coherent* with each other (and with the pipeline.) So it does have to behave as if there was a unified L1, and as if instruction prefetch into the pipeline was negligible or flushed on jumps. (In practice flushing on jumps would be terrible, [so modern x86 chooses do something stronger](https://stackoverflow.com/questions/17395557/observing-stale-instruction-fetching-on-x86-with-self-modifying-code) that allows higher performance.) — Peter Cordes, Jul 10 '23 at 18:41

Which cpus have explicit cache flush assembly instructions?

1 Answers1