0

For some precise measurements, I would like to invalidate/flush all caches up to RAM (main memory), from the command line (so that the main program running time evaluation is not affected by this process). I have found the following (The first and last from here):

1. echo 3 > /proc/sys/vm/drop_caches

and I could build a (pre-executed) program with the following

2. #include <asm/cachectl.h>
int cacheflush(char *addr, int nbytes, int cache);

or I could finally do a

3. int main() {
     const int size = 20*1024*1024; // Allocate 20M. Set much larger then L2
     char *c = (char *)malloc(size);
     for (int i = 0; i < 0xffff; i++)
       for (int j = 0; j < size; j++)
         c[j] = i*j;
 }

My question is: for what I need to do, which version is best, and if it is #2, what is the address I should be giving it as a start address? My uname -a is Linux 3.2.0-33-generic #52-Ubuntu SMP Thu Oct 18 16:19:45 UTC 2012 i686

Community
  • 1
  • 1
Dervin Thunk
  • 19,515
  • 28
  • 127
  • 217

1 Answers1

3

You're running on an operating system that will do other things behind your back. The operating system will handle interrupts, run various daemons in the background and do various maintenance tasks and potentially move your running process to a different cpu, etc.

Invalidating caches is the least of your worries and if your measurements have to be this accurate, you need to reevaluate the environment where you run your test. Even if you manage to get everything the operating system does under control (which basically means making your tested code part of the operating system), you still need to consider TLB behavior and branch prediction buffers (which will affect your performance more than caches), get control over SMM (which you typically can't unless you have control over your BIOS) and understand how the clocks you use for measuring really behave (I'd guess that a temperature difference of 10 degrees will affect your measurement more than having a clean cache).

In other words - forget it. A typical way to measure things realistically is to run it "enough" times and take an average (or minimum or maximum or median, depending on what you want to prove).

To add more: Your method number 1 flushes the filesystem caches and has nothing to do with data caches on the cpu. Number 2 I have no idea about, my linux flavour doesn't have it. Number 3 might work if you'd have perfect cache associativity on your cpu, which you don't and would have to make sure that the physical pages allocated by the operating system will touch every possible cache line, which you can't. You'd also have to make sure that you either execute it on the same cpu your test will run on, or on all cpus and nothing will get scheduled to run in between. Since you want to run this from the command line, your shell will stomp all over the caches long before your program runs (and the exec system call and filesystem operations won't help).

The only way to reliably clear caches on your architecture is the wbinvd instruction which you're not allowed to call because you're not the kernel and are not supposed to mess around with caches.

Art
  • 19,807
  • 1
  • 34
  • 60
  • Sorry, @Art, that's not an answer to my question. – Dervin Thunk Nov 20 '12 at 13:23
  • 1
    @DervinThunk Might not be what the answer you want to hear, but it's the answer you need to hear. Your question reveals assumptions that are just simply not possible to meet on any modern operating system. I expanded the answer to adress the methods you mention. The conclusion remains - you can't do what you want to do and even if you could it wouldn't give you the result you expect. – Art Nov 20 '12 at 13:47
  • 1
    There is a `CLFLUSH` instruction that can be used to flush individual cache lines (which can be used from user-space); except you'd need to do this instruction 16384 times per MiB of memory that might (or might not) be in a CPU's cache, and a CPU can (speculatively) re-fetch data into the cache immediately after you've flushed it. – Brendan Nov 20 '12 at 15:10
  • To sum it up from a different angle... Assume there was a utility that would flush all L1, L2 and higher caches. The way CPU caches work, and given that you're on a modern operating system with many other daemons/services running, by the time you got your command prompt back after running the command, your caches would all be full again anyway... – twalberg Nov 20 '12 at 15:14
  • @Brendan That's a good point. I forgot about speculative execution. That's another wrench into the machinery here. With any Intel cpu as old as Core 2 or more modern or anything beyond Opteron from AMD, speculative execution means you can't even flush the TLB properly without having it refetched before the flush finishes (this is a massive PITA in memory management), much less any normal cache. – Art Nov 20 '12 at 23:32
  • @Brendan I don't mind the caches being full, I just don't want them to be full with my program's data. I need all my data to produce a cache miss the first time, that's all. – Dervin Thunk Nov 22 '12 at 13:28
  • 1
    @DervinThunk Why don't you worry about TLBs instead since a TLB miss/hit is likely to have larger impact on the performance than a cache miss/hit? How about the data cache lines that more modern CPUs use for walking page table entries when resolving TLB misses? What about branch prediction buffers? – Art Nov 22 '12 at 13:40
  • @Art, I understand your concerns, but gee, give me the benefit of the doubt. I know what I want, and it's not what you're giving me. Cache oblivious algorithms have an assumption of tall caches, where the size of the memory is at least B^2, B is the size of the line in words. I'm trying to measure the asymptotic behavior of memory transfers, and do so empirically, to see if I'm getting lower-upper bounds correctly. I want to be systematic, and scientifically careful, by making sure I'm starting with a fresh cache in each trial. I don't care about performance, I care about precision. – Dervin Thunk Nov 22 '12 at 14:27
  • 1
    Then your question wasn't very well phrased. Just call the CLFLUSH instruction on each cache line of the memory region with your data, just before the measure (not command line as in the question). Speculative execution will most likely start fetching the memory into the cache before you start your timer, but you could get close. The noise from other sources will be greater than the time to fetch stuff into the cache, but that's the best you can do. – Art Nov 22 '12 at 15:14