How to come up with a high cache miss rate example?

Question

I'm trying to come up with an example program which would have a high cache-miss rate. I thought I could try accessing a matrix column by column like so:

#include <stdlib.h>

int main(void)
{
    int i, j, k;

    int w = 1000;
    int h = 1000;

    int **block = malloc(w * sizeof(int*));
    for (i = 0; i < w; i++) {
        block[i] = malloc(h * sizeof(int));
    }

    for (k = 0; k < 10; k++) {
        for (i = 0; i < w; i++) {
            for (j = 0; j < h; j++) {
                block[j][i] = 0;
            }
        }
    }

    return 0;
}

when I compile this with -O0 flag and run using perf stat -r 5 -B -e cache-references,cache-misses ./a.out it gives me:

 Performance counter stats for './a.out' (5 runs):

    715,463 cache-references                                      ( +-  0.42% )
    527,634 cache-misses          #   73.747 % of all cache refs  ( +-  2.53% )

0.112001160 seconds time elapsed                                  ( +-  1.58% )

which is good enough for my purposes. However if I go ahead and change the matrix size to 2000x2000 it gives:

 Performance counter stats for './a.out' (5 runs):

  6,364,995 cache-references                                      ( +-  2.32% )
  2,534,989 cache-misses          #   39.827 % of all cache refs  ( +-  0.02% )

0.461104903 seconds time elapsed                                  ( +-  0.92% )

and if I increase it even further to 3000x3000 I get:

 Performance counter stats for './a.out' (5 runs):

 59,204,028 cache-references                                      ( +-  1.36% )
  5,662,629 cache-misses          #    9.565 % of all cache refs  ( +-  0.11% )

1.116573625 seconds time elapsed                                  ( +-  0.32% )

which is strange because I would expect to get more cache miss rate as the size increases. I need something that will be as platform independent as possible. computer architecture class was long ago so any insight would be welcomed..

Notes

I said I need something relatively platform independent but still these are my specs:

Intel® Core™ i5-2467M
4 GiB RAM
64 bit ubuntu 12.04

Maybe [this question of mine](http://stackoverflow.com/questions/14543965/cpu-cache-critical-stride-test-giving-unexpected-results-based-on-access-type) could help? I posted a sample program that is meant to cause high cache misses. I haven't figured out the answer to my question yet, but the post might be useful for your purposes. — Andy Prowl, Jan 31 '13 at 21:43
I wonder if `gcc` is `null`ing the `malloc` for `-O0`? Dump your matrix. I would up the optimisation level. If this is the case, it could be misleading your stats. — Alex Chamberlain, Jan 31 '13 at 21:44
From Bjarne Stroustrup himself (paraphrased): "The canonical example of an asinine cache-missing piece of code is the linear traversal of a linked-list." You might try that. If the nodes are created somewhat non-linearly and on a heap that is already heavily segmented / used (and the list is large, of course), the expected rate of cache-misses is pretty much 100%. — Mikael Persson, Jan 31 '13 at 22:12
@MikaelPersson: and if in doubt about the segmentation/fragmentation of your node allocations, you can always shuffle the list before traversing :-) — Steve Jessop, Jan 31 '13 at 22:23
@MikaelPersson I think that's from going native conference. so I'm trying push 10000 numbers to a forward list and then read them with an iterator atm but it gives me only like `16%`. how can I make sure elements are at random positions? — none, Jan 31 '13 at 22:23
Interesting that the number of cache misses and the time are increasing roughly in proportion to the total memory you're using, whereas the number of cache references is ballooning beyond the expected `n^2` growth. What's with that, then? If I'm right that it's weird, then that's what's diluting your cache misses down to an unexpectedly low percentage. — Steve Jessop, Jan 31 '13 at 22:31
@SteveJessop that's quite remarkable and I have no idea why. — none, Jan 31 '13 at 22:47

Paul R · Answer 1 · 2013-01-31T22:32:46.603

9

Beware of automatic prefetch in modern CPUs - it can often detect strided accesses. Perhaps try a random access pattern, e.g.:

int main(void)
{
    int i;

    int n = 1000 * 1000;

    int *block = malloc(n * sizeof(int));

    for (i = 0; i < n / 10; i++) {
         int ri = rand() % n;
         block[ri] = 0;
    }

    return 0;
}

edited Jan 31 '13 at 22:32

answered Jan 31 '13 at 21:44

Paul R

208,748
37
389
560

Your point that modern CPUs will detected "strided accesses" is valid. However note that your program does *not* have the same behavior as the original program. – Nik Bougalis Jan 31 '13 at 21:46
@Nik: no, the behaviour is not identical, but since the goal is apparently just to produce a high cache miss rate I'm assuming that this is good enough. If it still needs to zero the entire array then you'd need a more sophisticated pseudo-rando access pattern. – Paul R Jan 31 '13 at 21:48
I don't disagree. Just "documenting" ;) – Nik Bougalis Jan 31 '13 at 21:49
ok so it gives `30%`, `77%` and `75%` for matrices of size `1000`, `2000` and `3000` respectively. execution time increases a lot when matrix sizes are getting bigger. can we come up with an example using a small `1000` matrix? – none Jan 31 '13 at 21:54
For a small matrix you'd need to get rid of the outer `k` loop, otherwise L2/L3 cache will frustrate your aims. (Answer now edited to remove the outer `k` loop.) – Paul R Jan 31 '13 at 21:56
I thought since we don't need to access all elements I could decrease the numbers in the nested for loops to some fixed value like `10`, `100` and `100` but it gives me `37%` for the `3000` matrix. – none Jan 31 '13 at 21:56
1

@gokcehan: the problem is that 1000x1000 ints basically fits in the L1 cache, which means (if there were no OS or anything), _no_ access pattern will have _any_ misses. You're getting misses primarily because other programs and the OS are also running. – Mooing Duck Jan 31 '13 at 22:03
2

@Mooing Duck, That's a heck of processor you have – Jan 31 '13 at 22:12
@MooingDuck So I have tried to remove the outer most for loop (it was redundant anyway) and fix the number of accesses (not to increase the execution time too much) while increasing the matrix size. `1000-2000-3000-4000-5000-6000` matrix sizes gives `41-72-63-53-46-43 %` respectively. it seems like the right approach but what's with the the peaking behavior? – none Jan 31 '13 at 22:16
@arrows: it was the only explanation that made sense to me. A quick check shows my L1 cache is nowhere close to that big. I guess it just detected strided accesses. – Mooing Duck Jan 31 '13 at 22:23
@gokcehan: I once got stellar results with allocating a 1GB array and writing data to random elements (counting sort). Have you tried that? (I have no idea about the peaking behavior) – Mooing Duck Jan 31 '13 at 22:25
@PaulR: `rand` can be slow, you might get better performance doing one `rand` call mod `w*h`, and calculate the x and y from that. – Mooing Duck Jan 31 '13 at 22:27
Based on comments etc I've updated the answer now to make it slightly more efficient and to make the accesses more sparse (mitigates the effects of whole cache lines being read). – Paul R Jan 31 '13 at 22:32

paddy · Answer 2 · 2013-01-31T22:08:33.370

I'm not completely certain you can compare these programs or really guarantee anything, because it depends on how the OS is allocating individual chunks of memory.

You should at least allocate ALL memory as a single block, then index into that block to get all the arrays (int* and int). That way you have a consistent starting point. You may want to pass the array size as an argument instead of recompiling each time.

You can also tweak it so that you allocate WAY more memory than you need and put each row (or column, the way you have written it), to guarantee that only one row (column) of the matrix will be loaded in cache at any one time. ie find out the size of your cache, and space each chunk at least that many bytes apart.

Note that you should really free your memory before exiting.

As already pointed out by others, randomizing your access pattern is a good idea.

I have tried continuous memory block at some point, gives me about same numbers. passing the array size and freeing memory doesn't sound essential at this point unless I'm missing something. — none, Jan 31 '13 at 22:07
Fair enough, maybe I'm oldschool. From the days when you could leak memory outside your process. — paddy, Jan 31 '13 at 22:24

How to come up with a high cache miss rate example?

2 Answers2

Linked