Different cache miss count for a same program between multiple runs

Question

I am using Cachegrind to retrieve the number of cache misses of a static program compiled without libc (just a _start that calls my main function and an exit syscall in asm). The program is fully deterministic, the instructions and the memory references does not change from one run to another. The cache is fully-associative with LRU as replacement policy.

However, I noticed that the number of misses changes sometimes. More specifically, the number of misses is always the same until I go to a different directory:

 % cache=8 && valgrind --tool=cachegrind --I1=$((cache * 64)),$cache,64 --D1=$((cache * 64)),$cache,64 --L2=262144,4096,64 ./adpcm        
 ...
 ==31352== I   refs:      216,145,010
 ...
 ==31352== D   refs:      130,481,003  (95,186,001 rd   + 35,295,002 wr)
 ==31352== D1  misses:        240,004  (   150,000 rd   +     90,004 wr)
 ==31352== LLd misses:             31  (        11 rd   +         20 wr)

And if I execute the same command again and again, I will keep having the same results. But if I run this program from a different directory:

 % cd ..
 % cache=8 && valgrind --tool=cachegrind --I1=$((cache * 64)),$cache,64 --D1=$((cache * 64)),$cache,64 --L2=262144,4096,64 ./malardalen2/adpcm
 ...
 ==31531== I   refs:      216,145,010
 ...
 ==31531== D   refs:      130,481,003  (95,186,001 rd   + 35,295,002 wr)
 ==31531== D1  misses:        250,004  (   160,000 rd   +     90,004 wr)
 ==31531== LLd misses:             31  (        11 rd   +         20 wr)

And I even have a different result from a different directory.

I've also done some experiments with a Pin tool and with this one I don't need to change the directory to get different values. But it seems that the set of possible values is very limited and is exactly the same as with Cachegrind.

My question is: what could be the sources of such differences?

My first hint is that my program is not aligned the same way in memory and as a consequence, some variables stored in the same line in a previous run are not anymore. That could also explain the limited number of combinations. But I though that cachegrind (and Pin) were using the virtual addresses and I'd assume that the OS (Linux) is always giving the same virtual addresses. Any other idea?

Edit: As you can guess reading the LLd misses, the program only uses 31 different cache lines. Also, the cache can only contain 8 cache lines. So even on real, the difference can't be explained by the idea of the cache being already populated the second time (at max, only 8 lines could stay in the L1).

Edit 2: Cachegrind's report is not based on actual cache misses (given by performance counters) but is the result of a simulation. Basically, it simulate the behavior of a cache in order to count the number of misses. Since the consequences are only temporal, that's totally fine and that allows to change the cache properties (size, associativity).

Edit 3: The hardware I am using is an Intel Core i7 on a Linux 3.2 x86_64. The compile flags are -static and for some programs -nostdlib (IIRC, I'm not at home right now).

@JoachimPileborg I can go back to the previous directory and I will have another miss count or maybe the same as before. I think cachegrind starts with an empty stack to simulate the cache (and 31 is the number of cold misses). — Maxime Chéramy, Jun 28 '13 at 15:51
When you change current working directory, you also change corresponding environment variable (and its length). Since a copy of all environment variables is usually stored just above the stack, you get different allocation for stack variables and different number of cache misses. (And shell could change some other variables besides "PWD"). — Evgeny Kluev, Jul 02 '13 at 10:02
@EvgenyKluev That's true, yesterday I also tried to align the stack using a variable at the beginning of the program and `__attribute__ (aligned (64))`. It did not solved my problem though. — Maxime Chéramy, Jul 02 '13 at 10:31
I think, aligning the stack this way could only help if you have fully associative cache. In case of only 1-way, 2-way or 4-way cache you could still get your stack variables and static variables aliased in different way depending on environment size. You could try to align it by your cache's size instead of just 64 bytes... — Evgeny Kluev, Jul 02 '13 at 10:58
I am only considering fully associative caches for now. Why the static variables could depend on the environment size? — Maxime Chéramy, Jul 02 '13 at 11:09
No, static variables allocation, most likely, does not depend on the environment size, but stack variables allocation does. And they may be "moved" relative to static variables. But all this doesn't matter since your cache is fully associative. — Evgeny Kluev, Jul 02 '13 at 11:14
You know the cache is shared by the whole system? How do you know that it isn't some kernel task like `kjournald`, `kswapd`, etc which is writing an *atime* for some directory you are in. Anything in the system could end up causing a cache hit/miss by evicting some random line. Maybe the current directory hits a *hash* table entry which happens to share the same line. — artless noise, Jul 05 '13 at 01:58
Thanks, but I'm studying caches, don't you think I know that? :). Cachegrind (the way I'm using it) doesn't use the performance counters, it only catches the memory addresses (the virtual ones I think) and simulates the behavior of the caches with arrays in order to get the number of misses. That's also how you can change the size of the caches. — Maxime Chéramy, Jul 05 '13 at 06:20
What hardware are you running this on? also what are your compile flags?... — TheCodeArtist, Jul 07 '13 at 12:07
Performing [**hot-spot analysis**](http://c.learncodethehardway.org/book/ex41.html) using cachegrind/callgrind will give you a different perspective. Maybe the program is following a different code-path in each of the multiple runs? — TheCodeArtist, Jul 07 '13 at 12:45
@TheCodeArtist As already said in the question, the programs are fully deterministic. I have abandoned the use of Cachegrind for now, I decided to use the memory trace directly and do what cachegrind does myself. The trace is given by Gem5 and does not change between 2 runs if I stop the randomization of the stack address. — Maxime Chéramy, Jul 07 '13 at 12:55
@Maxime Since you are puzzled about the varying cache-misses in each case, identifying where exactly they occur in your code will help you understand the behaviour. This can be quickly done using callgrind with annotated source as described in this [**link**](http://c.learncodethehardway.org/book/ex41.html). Once the few lines of code which generate the additional cache-misses(spend significantly varying amounts of time) is identified. Studying the assembly should lead us to understanding the behaviour. I would like to try this out as well. Can you share the src/link to the adpcm binary? — TheCodeArtist, Jul 07 '13 at 14:30
ADPCM is here: http://www.mrtc.mdh.se/projects/wcet/wcet_bench/adpcm/adpcm.c I changed the main in order to call sys_exit (in asm). But I observe that with almost any program (with a small cache)... — Maxime Chéramy, Jul 07 '13 at 16:46

Maxime Chéramy · Accepted Answer · 2013-07-01T16:38:33.630

5

Linux implements the "Address space layout randomization" technique (http://en.wikipedia.org/wiki/Address_space_layout_randomization) for security matters. And you can deactivate this behavior like this:

echo -n "0" > /proc/sys/kernel/randomize_va_space

You can test that through this example:

#include <stdio.h>

int main() {
   char a;
   printf("%u\n", &a);
   return 0;
}

You should always have the same value printed.

Before:

 % ./a.out
4006500239
 % ./a.out
819175583
 % ./a.out
2443759599
 % ./a.out
2432498159

After:

 % ./a.out
4294960207
 % ./a.out
4294960207
 % ./a.out
4294960207
 % ./a.out
4294960207

That also explain the different amount of cache misses since two variables that were in the same line can now be in two different lines.

Edit: This does not solve entirely the problem apparently but I think it was one of the reason. I'll give the bounty to anyone who can help me resolving this issue.

edited Jul 01 '13 at 16:38

answered Jul 01 '13 at 14:04

Maxime Chéramy

17,761
8
54
75

If I run this example in cachegrind, I have different values (4278190991 for instance) and this value changes... And when I have a different value, I also have a different number of cache misses. – Maxime Chéramy Jul 01 '13 at 16:44
Does the kernel possibly randomize where the binary and/or shared libraries get loaded? The flavor of linux I use here doesn't, but I know some of them do. – Art Jul 04 '13 at 07:44
The program is compiled in static (so no shared library to load). The addresses for the global variables does not seem to change. – Maxime Chéramy Jul 04 '13 at 08:00
Make the program pause after startup. Check its memory layout with the `pmap` command. What changes on every execution? – Art Jul 04 '13 at 12:57

Emilcasvi · Answer 2 · 2013-07-08T13:52:40.243

It seems this is a known behavior in valgrind:

I used the example that outputs the cache base address, I also disabled the layout randomization.

I ran the executable twice getting the same results in both runs:

D   refs:       40,649  (28,565 rd   + 12,084 wr)
==15016== D1  misses:     11,465  ( 8,412 rd   +  3,053 wr)
==15016== LLd misses:      1,516  ( 1,052 rd   +    464 wr)
==15016== D1  miss rate:    28.2% (  29.4%     +   25.2%  )
==15016== LLd miss rate:     3.7% (   3.6%     +    3.8%  )

villar@localhost ~ $ cache=8 && valgrind --tool=cachegrind --I1=$((cache * 64)),$cache,64 --D1=$((cache * 64)),$cache,64 --L2=262144,4096,64 ./a.out 

==15019== D   refs:       40,649  (28,565 rd   + 12,084 wr)
==15019== D1  misses:     11,465  ( 8,412 rd   +  3,053 wr)
==15019== LLd misses:      1,516  ( 1,052 rd   +    464 wr)
==15019== D1  miss rate:    28.2% (  29.4%     +   25.2%  )
==15019== LLd miss rate:     3.7% (   3.6%     +    3.8%  )

According to the cachegrind documentation (http://www.cs.washington.edu/education/courses/cse326/05wi/valgrind-doc/cg_main.html)

Another thing worth nothing is that results are very sensitive. Changing the size of the >valgrind.so file, the size of the program being profiled, or even the length of its name can perturb the results. Variations will be small, but don't expect perfectly >repeatable results if your program changes at all. While these factors mean you shouldn't trust the results to be super-accurate, hopefully >they should be close enough to be useful.

After reading this, I changed the file name and got the following:

villar@localhost ~ $ mv a.out a.out2345345345
villar@localhost ~ $ cache=8 && valgrind --tool=cachegrind --I1=$((cache * 64)),$cache,64 --D1=$((cache * 64)),$cache,64 --L2=262144,4096,64 ./a.out2345345345 

==15022== D   refs:       40,652  (28,567 rd   + 12,085 wr)
==15022== D1  misses:     10,737  ( 8,201 rd   +  2,536 wr)
==15022== LLd misses:      1,517  ( 1,054 rd   +    463 wr)
==15022== D1  miss rate:    26.4% (  28.7%     +   20.9%  )
==15022== LLd miss rate:     3.7% (   3.6%     +    3.8%  )

Changing the name back to "a.out" gave me exactly the same result as before.

Notice that changing the file name or the path to it will change the base of the stack!!. and this may be the cause after reading what Mr. Evgeny said in a prior comment

When you change current working directory, you also change corresponding environment variable (and its length). Since a copy of all environment variables is usually stored just above the stack, you get different allocation for stack variables and different number of cache misses. (And shell could change some other variables besides "PWD").

EDIT: Documentation also says:

Program start-up/shut-down calls a lot of functions that aren't interesting and just complicate the output. Would be nice to exclude these somehow.

The simulated cache may be tracking the start and end of the program being it the source of the variations.

+1 for the quote of the documentation and the effort to reproduce. However, once, I had no change of the stack address and different results. But I was not able to reproduce it... — Maxime Chéramy, Jul 08 '13 at 14:35
You may check each miss against the code line that produces it with cg_anotate (http://cs.swan.ac.uk/~csoliver/ok-sat-library/internet_html/doc/doc/Valgrind/3.7.0/html/cg-manual.html) Another option: valgrind simulates the cache by converting x86 instructions to micro-ops. Since your code is small enough and does not have any dependence, you could modify valgrind in order to trace and analyze the blocks of code and the addresses that are being referenced. You can find more info on how to do this at http://www.cs.washington.edu/education/courses/cse326/05wi/valgrind-doc/cg_techdocs.html. — Emilcasvi, Jul 08 '13 at 15:27

Different cache miss count for a same program between multiple runs

2 Answers2