52

I am interested in forcing a CPU cache flush in Windows (for benchmarking reasons, I want to emulate starting with no data in CPU cache), preferably a basic C implementation or Win32 call.

Is there a known way to do this with a system call or even something as sneaky as doing say a large memcpy?

Intel i686 platform (P4 and up is okay as well).

Ciro Santilli OurBigBook.com
  • 347,512
  • 102
  • 1,199
  • 985
user183135
  • 3,095
  • 2
  • 18
  • 7

4 Answers4

56

Fortunately, there is more than one way to explicitly flush the caches.

The instruction "wbinvd" writes back modified cache content and marks the caches empty. It executes a bus cycle to make external caches flush their data. Unfortunately, it is a privileged instruction. But if it is possible to run the test program under something like DOS, this is the way to go. This has the advantage of keeping the cache footprint of the "OS" very small.

Additionally, there is the "invd" instruction, which invalidates caches without flushing them back to main memory. This violates the coherency of main memory and cache, so you have to take care of that by yourself. Not really recommended.

For benchmarking purposes, the simplest solution is probably copying a large memory block to a region marked with WC (write combining) instead of WB. The memory mapped region of the graphics card is a good candidate, or you can mark a region as WC by yourself via the MTRR registers.

You can find some resources about benchmarking short routines at Test programs for measuring clock cycles and performance monitoring.

Brandon
  • 16,382
  • 12
  • 55
  • 88
Gunther Piez
  • 29,760
  • 6
  • 71
  • 103
  • 1
    Ohh, I stand correct, neat I didn't know about this instruction. – Falaina Nov 18 '09 at 16:37
  • 2
    The wbinvd instruction takes on the order of 2000-5000 clock cycles to complete! Most instructions take 2-5, on average. – unixman83 Sep 15 '11 at 12:53
  • Does `wbinvd` inside virtual8086 mode (e.g. a DOS program under 32-bit Windows) actually affect the host CPU? `cli` gets virtualized like other privileged instructions. (And BTW, `invd` is more than just "not really recommended", unless that's understatement for humour. You *must not* use `invd` except for cases like leaving cache-as-RAM mode; an interrupt handler could have just dirtied cache a couple cycles before you execute it on this or another core, causing it to corrupt the OS's state by discarding that store.) – Peter Cordes May 11 '19 at 03:07
9

There are x86 assembly instructions to force the CPU to flush certain cache lines (such as CLFLUSH), but they are pretty obscure. CLFLUSH in particular only flushes a chosen address from all levels of cache (L1, L2, L3).

something as sneaky as doing say a large memcopy?

Yes, this is the simplest approach, and will make sure that the CPU flushes all levels of cache. Just exclude the cache flushing time from your benchmakrs and you should get a good idea how your program performs under cache pressure.

Marco Bonelli
  • 63,369
  • 21
  • 118
  • 128
intgr
  • 19,834
  • 5
  • 59
  • 69
  • 1
    "will make sure that the CPU flushes all levels of cache." Not true, as I stated, modern commercial cpus, especially when abstracted by an operating system, can (and probably do) have very complicated caching strategies. – marr75 Nov 18 '09 at 15:48
  • 5
    I believe you are confusing the CPU cache with other OS-level caches. The OS has basically no say in what the CPU will cache or not cache, because these decisions need to happen so quickly, there is no time for kernel interrupts or anything of the like. CPU cache is implemented purely in silicon. – intgr Nov 18 '09 at 15:56
  • 1
    A context switch will indeed let other processes run and thereby pollute the cache. But this is normal part of OS behavior -- it will take place with or without the benchmark, so it makes sense to include this in your timings anyway. – intgr Nov 18 '09 at 16:00
  • 6
    The CLFLUSH instruction does not flush only the L1 cache. From the Intel x86-64 reference manual: "The CLFLUSH (flush cache line) instruction writes and invalidates the cache line associated with a specified linear address. The invalidation is for all levels of the processor’s cache hierarchy, and it is broadcast throughout the cache coherency domain." – Michael Boyer May 23 '14 at 01:06
2

There is unfortunately no way to explicitly flush the cache. A few of your options are:

1.) Thrash the cache by doing some very large memory operations between iterations of the code you're benchmarking.

2.) Enable Cache Disable in the x86 Control Registers and benchmark that. This will probably disable the instruction cache also, which may not be what you want.

3.) Implement the portion of your code your benchmarking (if it's possible) using Non-Temporal instructions. Though, these are just hints to the processor about using the cache, it's still free to do what it wants.

1 is probably the easiest and sufficient for your purposes.

Edit: Oops, I stand corrected there is an instruction to invalidate the x86 cache, see drhirsch's answer

Falaina
  • 6,625
  • 29
  • 31
  • 2
    Your claim that there is no instruction for cache flushing is wrong. And rewriting a routine using non temporal instructions for benchmarking is nonsense. If the data the routine is using fits in the caches, it would run way slower during the benchmarking, making the measurements worthless. – Gunther Piez Nov 18 '09 at 16:45
  • There is no way to explicitly flush the cache from windows. You are denied direct access to the hardware... there are non-portable assembly instructions that can do it. – marr75 Nov 18 '09 at 16:48
  • 3
    You can easily do it in Windows 95,98, ME. And even for the modern windows variants you can implement it in ring 0 using a driver. – Gunther Piez Nov 18 '09 at 16:51
  • @drhirsch While I do stand corrected on the instruction for flushing the cache (thanks!), I disagree with your assessment of the use of non-temporal instructions. If he did the initial data loads for his benchmark using non-temporal instructions it isn't that much different from running with an empty cache and would be a sufficient way to simulate cold cache misses (though, I admit not nearly as correct as using the flush instruction!) – Falaina Nov 18 '09 at 17:53
  • 2
    I apollogize, I was a bit harsh. But you can't modify a program using non temporal instructions to simulate cold cache behavior for benchmarking. 1) You would need to unroll exactly one loop and make it nontemporal, thus changing the control flow and the usage of the inctruction cache. 2) If the data resides in cache before the start, even non temporal instructions will load the data from the cache, and you will get a warm cache result. 3) If not, the second iteration will need to fetch the data from memory again, you will get a result with doubled memory latencies. – Gunther Piez Nov 18 '09 at 18:35
  • 1
    x86 doesn't have general-purpose non-temporal *loads*. SSE4 `movntdqa` loads are only special when reading from WC memory, not normal write-back (WB) memory regions. (The manual says the NT hint may be ignored; that is the case on all current implementations except for reading from WC memory, e.g. for copying from video RAM to main memory.) – Peter Cordes May 11 '19 at 03:11
1

The x86 instruction WBINVD writes back and invalidates all caches. It is described as:

Writes back all modified cache lines in the processor’s internal cache to main memory and invalidates (flushes) the internal caches. The instruction then issues a special-function bus cycle that directs external caches to also write back modified data and another bus cycle to indicate that the external caches should be invalidated.

Importantly, the instruction can only be executed in ring0, i.e. the operating system. So your userland programs can't simply use it. On Linux, you can write a kernel module that can execute that instruction on demand. Actually, someone already wrote such a kernel module: https://github.com/batmac/wbinvd

Luckily, the kernel module's code is really tiny, so you can actually check it before loading code from strangers on the internet into your kernel. You can use that module (and trigger executing the WBINVD instruction) by reading /proc/wbinvd, for example via cat /proc/wbinvd.

However, I found that this instruction (or at least this kernel module) is really slow. On my i7-6700HQ I measured it to take 750µs! This number seems really high to me, so I might have made a mistake measuring this -- please keep that in mind! Explanation of that instruction just say:

The amount of time or cycles for WBINVD to complete will vary due to size and other factors of different cache hierarchies.

Lukas Kalbertodt
  • 79,749
  • 26
  • 255
  • 305
  • Note: I know that this question is asking about Windows. However, it is linked from many places that are not talking about a specific OS, so I thought mentioning the kernel module makes sense. – Lukas Kalbertodt May 10 '19 at 18:29
  • Hi, I was wondering if you have checked as well if this kernel module invalidates L1 and L2 cache of all the cores? As Intel documentation says, non-shared caches may not be written back nor invalidated. Basically that figure shows that only private L1, L2 of the core and shared L3 will be written back and invalidated, but other cores L1 and L2 won't. However, when I tested this kernel module, I observed that it invalidates L1 and L2 of other cores as well. – Ana Khorguani Apr 06 '20 at 09:57
  • I was wondering if there is a loop calling wbinvd instruction for each core? I'm not sure how to check that. Otherwise I am confused how is this modules wbinvd does what is more or less not provided by the instruction itself? – Ana Khorguani Apr 06 '20 at 09:57
  • @AnaKhorguani I don't know which caches are flushed exactly, sorry. I assumed all caches (including L1 and L2) are flushed, but I am not sure. And no idea about your core question either, sorry! – Lukas Kalbertodt Apr 06 '20 at 10:26
  • ok, thanks anyway. In the code there is a function call wbinvd_on_all_cpus. I was not able to find the implementation itself, but I assume it calls wbinvd for all the cores, though I might check with the module author himself then :) – Ana Khorguani Apr 06 '20 at 10:33