Why does CLFLUSH exist in x86?

Question

I recently learned about the row hammer attack. In order to perform this attack the programmer needs to flush the complete cache hierarchy of a CPU for a specific number of addresses.

My question is: why is CLFLUSH necessary in x86? What are the reasons for ever using this instruction, if all L* caches act transparently (i.e., no explicit cache invalidation needed)? Besides that: isn't the CPU free to speculate memory access patterns, and thereby ignore the instruction altogether?

Peter Cordes · Accepted Answer · 2021-06-27T17:10:15.123

I think the main use-case is Non-volatile DIMMs, especially Intel's Optane DC PM. It's normally mapped WB-cacheable so requires explicit flushes (or movnt) to make sure data is persisted to non-volatile storage.

(But clflush was introduced at the same time as SSE2, back in Pentium 4 days. I don't know what the idea was there; possibly explicit cache control for performance reasons, like the opposite of prefetch.)

Skylake introduced weakly-ordered higher performance CLFLUSHOPT because it's useful for non-volatile storage hooked up to the memory hierarchy directly. Flushing cache makes sure data is written out to actual memory, not still dirty in the CPU.

See also this SuperUser answer for some links and background on Optane DC PM (Persistent Memory). It's non-volatile storage in physical address-space, not just in virtual address space with software tricks.

Dan Luu's article on clwb and pcommit is interesting: the benefits of taking the OS out of the way for access to storage, detailing Intel's plans at that point for clflush / clwb and their memory-ordering semantics. It was written while Intel was still planning to require an instruction called pcommit (persistent commit) as part of this process, but Intel later decided to remove that instruction: Deprecating the PCOMMIT Instruction (from Intel) has some interesting info about why, and how things work under the hood.

It potentially also matters for non-cache-coherent DMA to devices, if anything can do that in x86. (But x86 has always had cache-coherent DMA, since the first x86 CPUs with caches, to avoid breaking existing software.)

Apparently it's not possible to map MMIO / PCIe device memory regions as write-back (WB) cacheable. how to do mmap for cacheable PCIe BAR Maybe P4 architects were considering that future possibility when they introduced it.

In that previous link, Dr. Bandwidth mentions a partial workaround that actually involves needing CLFLUSH to maintain correctness:

map the MMIO range twice -- once for store operations from the processor to the FPGA using the Write-Combining (WC) memory type, and once for reads from the processor to the FPGA using the Write Protect (WP) or Write Through (WT) types. You will need to maintain coherence manually by using CLFLUSH on cache lines in the "read only" region when you write to the alias of that line in the "write only" region.

So it is possible to create a situation where you might need clflush, other than for NV-DIMM.

I think Non-volatile main memory could not be the reason for introducing CLFLUSH because the instruction was first introduced in the NetBurst microarchitecture many years before even NVDIMM prototypes existed. Like you said in the last sentence, CLFLUSH was originally introduced because IO devices may perform non-coherent memory writes to a memory region with a coherent memory type, so the processor has to flush any cache lines that may be possibly cached to access the most recent data. INVD and WBINVD are are not useful in this scenario because they both result in data loss... — Hadi Brais, Jul 29 '20 at 18:21
...Before CLFLUSH, such memory regions must have an uncacheable type, which may hurt performance. PCIe supports both coherent and non-coherent IO memory accesses, so even on modern systems IO memory access can be non-coherent. Many years later when NVDIMMs were first introduced, CLFLUSH was then used also for ensuring memory persistency, especially on systems that don't support CLFLUSHOPT or CLWB. — Hadi Brais, Jul 29 '20 at 18:21
Oh, WBINVD could be used instead, but is obviously much less efficient. — Hadi Brais, Jul 29 '20 at 18:26
@HadiBrais: Thanks, I'd wondered how old `clflush` was, didn't realize it was *that* old. Was DMA still non-coherent at that point? I'm not clear on the timeline of if/when x86 had non-coherent DMA and if so what was used for flushing memory before `clflush`. `wbinvd` seems like it would be too expensive to be practical (even on single-core CPUs), and `movnt` stores weren't guaranteed to evict until later (like P-M or so IIRC). — Peter Cordes, Jul 29 '20 at 18:50
DMA has always supported cache coherency on all Intel processors with caches since the 80386. — Hadi Brais, Jul 29 '20 at 19:09

Why does CLFLUSH exist in x86?

1 Answers1

Linked