How many cache misses will we have for this simple program?

Question

I have a simple program as follows. When I compiled the code without any optimization, it took 5.986s (user 3.677s, sys 1.716s) to run on a mac with 2.4G i5 processor, and 16GB DDR3-1600 9 CAS memory. I am trying to figure out how many L1 cache miss happen in this program. Any suggestions? Thanks!

void main()
{
    int size = 1024 * 1024 * 1024;
    int * a = new int[size];

    int i;
    for (i = 0; i < size; i++) a[i] = i;
    delete[] a;
}

There is not `new` in C, use `malloc` or c++ tag, and `void main()` is very out of date — David Ranieri, Mar 15 '14 at 08:03
You're just *writing* to memory. The only cache misses you should care about happen when *reading* from memory. — CAFxX, Mar 15 '14 at 08:56
@CAFxX, no, writing to memory also involves cache lookups (except for streaming writes) and consumes BW - it doesn't create data dependency for the program execution, but once it becomes the bottleneck it would create backpressure and stall the program too. — Leeor, Mar 15 '14 at 10:07
@Leeor I don't see anything in the question that makes me think that the compiler can't use streaming writes in this case. Hence my comment above. — CAFxX, Mar 16 '14 at 00:47
@CAFxX, I doubt his compiler would use it unless asked to, and besides - the backpressure may also occur with streaming stores due to memory bandwidth congestion. Either way, your comment is very general, and as such is quite misleading. — Leeor, Mar 16 '14 at 00:58
@Leeor, since the post is tagged "performance" I *highly* doubt that the OP is compiling with -O0. Also, the OP is asking about *cache misses* not about bandwidth issues or backpressure. If we assume that an optimizing compiler is doing its job properly (i.e. inserting non-temporal stores) we should agree on the fact that there would be *no cache misses*: hence my first comment is correct *and* relevant (as well as very general, as you correctly pointed out). — CAFxX, Mar 16 '14 at 08:51
@CAFxX, -O3 doesn't make the code above produce streaming stores on my gcc 4.8.1 compiler, and I highly doubt you'll find many that do as it changes the semantics too drastically. — Leeor, Mar 16 '14 at 09:22
@Leeor really? Clang at -O2 turns the above into a `int main() { return 0; }` (yes, it doesn't even call `new`) so I guess that the semantics of C++ allow what I'm suggesting; it's simply that GCC doesn't implement that optimization. — CAFxX, Mar 17 '14 at 11:03

Liam M · Answer 1 · 2014-03-15T08:09:31.783

3

You can measure the number of cache misses using cachegrind feature of valgrind. This page provides a pretty detailed summary.

note: If you're using C then you should be using malloc. Don't forget to call free: as it stands your program will leak memory. If you are using C++ (and this question is incorrectly tagged), you should be using new and delete.

edited Mar 15 '14 at 08:09

answered Mar 15 '14 at 08:04

Liam M

5,306
4
39
55

score 3 · Answer 2 · edited May 23 '17 at 11:45

If you want really fine granularity measurements of cache misses, you should use Intel's architectural counters, which can accessed from userspace using the rdpmc instruction. The kernel module source I wrote in this answer will enable rdpmc in userspace for older CPU's.

Here is another kernel module to enable configuration of the counters for measuring last-level cache misses and last-level cache references. Note that I have hardcoded 8 cores, because that was what I had happened to use for my configuration.

#include <linux/module.h>   /* Needed by all modules */
#include <linux/kernel.h>   /* Needed for KERN_INFO */

#define PERFEVTSELx_MSR_BASE   0x00000186
#define PMCx_MSR_BASE          0x000000c1      /* NB: write when evt disabled*/

#define PERFEVTSELx_USR        (1U << 16)      /* count in rings 1, 2, or 3 */
#define PERFEVTSELx_OS         (1U << 17)      /* count in ring 0 */
#define PERFEVTSELx_EN         (1U << 22)      /* enable counter */

static void
write_msr(uint32_t msr, uint64_t val)
{
    uint32_t lo = val & 0xffffffff;
    uint32_t hi = val >> 32;
    __asm __volatile("wrmsr" : : "c" (msr), "a" (lo), "d" (hi));
}

static uint64_t
read_msr(uint32_t msr)
{
    uint32_t hi, lo;
    __asm __volatile("rdmsr" : "=d" (hi), "=a" (lo) : "c" (msr));
    return ((uint64_t) lo) | (((uint64_t) hi) << 32);
}

static uint64_t old_value_perfsel0[8];
static uint64_t old_value_perfsel1[8];

static spinlock_t mr_lock = SPIN_LOCK_UNLOCKED;
static unsigned long flags;

static void wrapper(void* ptr) {
    int id;
    uint64_t value;

    spin_lock_irqsave(&mr_lock, flags);

    id = smp_processor_id();

    // Save the old values before we do something stupid.
    old_value_perfsel0[id] = read_msr(PERFEVTSELx_MSR_BASE);
    old_value_perfsel1[id] = read_msr(PERFEVTSELx_MSR_BASE+1);


    // Clear out the existing counters
    write_msr(PERFEVTSELx_MSR_BASE, 0);
    write_msr(PERFEVTSELx_MSR_BASE + 1, 0);
    write_msr(PMCx_MSR_BASE, 0);
    write_msr(PMCx_MSR_BASE + 1, 0);
    if (clear){
        spin_unlock_irqrestore(&mr_lock, flags);
        return;
    }

    // Table 19-1 in the most recent Intel Manual - Architectural 
    // Last Level Cache References  Event select 2EH, Umask 4FH
    value = 0x2E | (0x4F << 8) |PERFEVTSELx_EN |PERFEVTSELx_OS|PERFEVTSELx_USR;
    write_msr(PERFEVTSELx_MSR_BASE, value);

    // Table 19-1 in the most recent Intel Manual - Architectural 
    // Last Level Cache Misses Event select 2EH, Umask 41H
    value = 0x2E | (0x41 << 8) |PERFEVTSELx_EN |PERFEVTSELx_OS|PERFEVTSELx_USR;
    write_msr(PERFEVTSELx_MSR_BASE + 1, value);


    spin_unlock_irqrestore(&mr_lock, flags);
}

static void restore_wrapper(void* ptr) {
    int id = smp_processor_id();
    if (clear) return;
    write_msr(PERFEVTSELx_MSR_BASE, old_value_perfsel0[id]);
    write_msr(PERFEVTSELx_MSR_BASE+1, old_value_perfsel1[id]);
}

int init_module(void)
{
    printk(KERN_INFO "Entering write-msr!\n");
    on_each_cpu(wrapper, NULL, 0);
    /* 
     * A non 0 return means init_module failed; module can't be loaded. 
     */
    return 0;
}

void cleanup_module(void)
{
    on_each_cpu(restore_wrapper, NULL, 0);
    printk(KERN_INFO "Exiting write-msr!\n");
}

Here is a user-space wrapper around rdpmc.

uint64_t
read_pmc(int ecx)
{
    unsigned int a, d;
    __asm __volatile("rdpmc" : "=a"(a), "=d"(d) : "c"(ecx));
    return ((uint64_t)a) | (((uint64_t)d) << 32);
}

score 2 · Answer 3 · answered Mar 15 '14 at 08:10

You have to be running on a 64 bit system. You are setting 4 GB of data to zero. The number of cache misses is 4 x 1024 x 1024 x 1024, divided by the cache line size. However, since all the memory accesses are sequential, you will not have many TLB misses etc. and the processor most likely optimises accesses to sequential cache lines.

score 1 · Answer 4 · answered Mar 15 '14 at 09:59

Your performance here (or lack thereof) is totally dominated by paging, every new page (probably 4k in your case) would cause a page fault since it's newly allocated and was never used, and trigger an expensive OS flow. Cachegrind and performance monitors should show you the same behavior, so you may be confused if you expected just simple data accesses.

One way to avoid this is to allocate, store once to the entire array (or even once per page) to warm up the page tables, and then measure time internally in you application (using rdtsc or any c API you like) over the main loop.
Alternatively, if you want to use external time measurements, just loop multiple times (> 1000), and amortize, so the initial penalty would become less significant.

Once you do all that, the cache misses you measure should reflect the number of accesses for each new 64 byte line (i.e. ~16M), plus the page walks (256k pages assuming they're 4k, multiplied by the page table levels, since each walk would have to lookup the memory on each level).
under a virtualized platform the paging would become squared (9 accesses instead of 3 for e.g.) since each level of the guest page table also needs paging on the host.

My machine has 16GB memory. The program only allocates an array that occupies 4GB memory. Hence I don't think that there is any page fault because of insufficient physical memory. — yuyang, Mar 15 '14 at 17:47
@yuyang, I didn't say anything about capacity misses, memory allocation is often optimized by the OS - don't expect malloc'd memory to always sit there waiting for you in the memory and page tables. See also - http://stackoverflow.com/questions/911860/does-malloc-lazily-create-the-backing-pages-for-an-allocation-on-linux-and-othe — Leeor, Mar 15 '14 at 19:41

How many cache misses will we have for this simple program?

4 Answers4