For x86-64 architecture, is there an instruction that can load data at a given memory address to the cache? Similarly, is there an instruction that can evict a cache line given a memory address corresponding to that cache line (or something like a cache line identifier)?
-
Not sure about per-line, but you can flush the whole cache: http://x86.renejeschke.de/html/file_module_x86_id_325.html Also, see: http://stackoverflow.com/questions/1756825/how-can-i-do-a-cpu-cache-flush-in-x86-windows – Ricky Mutschlechner Apr 12 '16 at 04:02
1 Answers
prefetch data into cache (without loading it into a register):
PREFETCHT0 [address]
PREFETCHT1 [address]
PREFETCHT2 [address]
intrinsic: void _mm_prefetch (char const* p, int hint)
See the insn ref manual and other guides for what the different nearness hints mean. (Other links at the x86 tag wiki).
The famous What Every Programmer Should Know About Memory article was written when P4 was current. Current CPUs have smarter hardware prefetchers, and hyperthreading is useful for much more than just running prefetch threads. Prefetch threads are a dead idea. Other than that, excellent article about caching; I wrote an SO answer with a modern review of what's changed and what's still relevant in Ulrich's original. Search for other SO posts and stuff to decide when to actually prefetch.
Do not overdo it with software prefetch on Intel IvyBridge. That specific microarchitecture has a performance bug, and can only retire one prefetch per 43 clocks.
Flush the cache line containing a given address:
clflush [address]
clflushopt [address] ; Newer CPUs only. Weakly ordered, with higher throughput.
intrinsic: void _mm_clflushopt (void const * p)
There was a recent question about its performance.

- 328,167
- 45
- 605
- 847
-
P.C. - what do you do? Do you work for Intel, or in serious HPC optimisation? Just curious. You need to consider putting stuff on a site like Agner Fog, perhaps with links to S.O. answers! – Brett Hale Apr 13 '16 at 12:22
-
3@BrettHale: I'm actually unemployed ATM. This is a hobby I enjoy :P (and why I have so much time to write SO answers). If anyone would like to employ me to optimize the crap out of stuff, that would be cool. – Peter Cordes Apr 13 '16 at 12:33
-
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/109051/discussion-between-peter-cordes-and-brett-hale). – Peter Cordes Apr 13 '16 at 13:20
-
`_mm_clflushopt()` doesn't seem to be supported by MSVC 2017 toolset v141. – Serge Rogatch Aug 06 '17 at 09:37
-
@SergeRogatch: That's unfortunate. Do they provide any alternate names for `clflushopt` intrinsics? Intel only documents [`_mm_clflushopt`](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=5452,4067,4729,94,76,270,3622,3874,3984,2679,1902,1234,3603&text=clflusho) – Peter Cordes Aug 06 '17 at 09:40
-
@PeterCordes, it seems no. Also their x64 compiler doesn't support inline assembly. So I'm considering linking an assembly file. That would incur function call overhead, so I'm in doubt whether all this will be faster than `_mm_clflush` (which is provided by the compiler). – Serge Rogatch Aug 06 '17 at 09:52
-
1@SergeRogatch: If you need to flush a big buffer, you could write a whole function in asm that includes a loop. The perf diff can be [very significant for larger buffers](https://stackoverflow.com/questions/44428520/opposite-of-cache-prefetch-hint). Or of course you could just use a compiler with more up-to-date support for Intel's intrinsics, either for that one file or for your whole project. – Peter Cordes Aug 06 '17 at 10:11