I don't understand cache miss count between cachegrind vs. perf tool

Question

I am studying about cache effect using a simple micro-benchmark.

I think that if N is bigger than cache size, then cache have a miss operation every first reading cache line.

In my machine, cache line size=64Byte, so I think totally cache occur N/8 miss operation and cache grind show that.

However, perf tool displays different result. It only occur 34,265 cache miss operations.

I am doubted about hardware prefetch, so turn off this function in BIOS. anyway, result is same.

I really don't know why perf tool's cache miss occur very small operations than "cachegrind". Could someone give me a reasonable explanation?

1. Here is a simple micro-benchmark program.

    #include <stdio.h>
    #define N 10000000

    double A[N];

    int main(){

    int i;
     double temp=0.0;

     for (i=0 ; i<N ; i++){
         temp = A[i]*A[i];
     }   

     return 0;
}

2. Following result is cachegrind's output:

    ==27612== Cachegrind, a cache and branch-prediction profiler
    ==27612== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al.
    ==27612== Using Valgrind-3.9.0 and LibVEX; rerun with -h for copyright info
    ==27612== Command: ./test
    ==27612== 
    --27612-- warning: L3 cache found, using its data for the LL simulation.
    ==27612== 
    ==27612== I   refs:      110,102,998
    ==27612== I1  misses:            728
    ==27612== LLi misses:            720
    ==27612== I1  miss rate:        0.00%
    ==27612== LLi miss rate:        0.00%
    ==27612== 
    ==27612== D   refs:       70,038,455  (60,026,965 rd   + 10,011,490 wr)
    ==27612== D1  misses:      1,251,802  ( 1,251,288 rd   +        514 wr)
    ==27612== LLd misses:      1,251,624  ( 1,251,137 rd   +        487 wr)
    ==27612== D1  miss rate:         1.7% (       2.0%     +        0.0%  )
    ==27612== LLd miss rate:         1.7% (       2.0%     +        0.0%  )
    ==27612== 
    ==27612== LL refs:         1,252,530  ( 1,252,016 rd   +        514 wr)
    ==27612== LL misses:       1,252,344  ( 1,251,857 rd   +        487 wr)
    ==27612== LL miss rate:          0.6% (       0.7%     +        0.0%  )

    Generate a report File
    --------------------------------------------------------------------------------
    I1 cache:         32768 B, 64 B, 4-way associative
    D1 cache:         32768 B, 64 B, 8-way associative
    LL cache:         8388608 B, 64 B, 16-way associative
    Command:          ./test
    Data file:        cache_block
    Events recorded:  Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
    Events shown:     Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
    Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
    Thresholds:       0.1 100 100 100 100 100 100 100 100
    Include dirs:     
    User annotated:   /home/jin/1_dev/99_test/OI/test.s
    Auto-annotation:  off

--------------------------------------------------------------------------------
         Ir I1mr ILmr         Dr      D1mr      DLmr         Dw D1mw DLmw 
--------------------------------------------------------------------------------
110,102,998  728  720 60,026,965 1,251,288 1,251,137 10,011,490  514  487  PROGRAM TOTALS

--------------------------------------------------------------------------------
         Ir I1mr ILmr         Dr      D1mr      DLmr         Dw D1mw DLmw          file:function
--------------------------------------------------------------------------------
110,000,011    1    1 60,000,003 1,250,000 1,250,000 10,000,003    0    0 /home/jin/1_dev/99_test/OI/test.s:main

--------------------------------------------------------------------------------
-- User-annotated source: /home/jin/1_dev/99_test/OI/test.s
--------------------------------------------------------------------------------
        Ir I1mr ILmr         Dr      D1mr      DLmr         Dw D1mw DLmw 

-- line 2 ----------------------------------------
         .    .    .          .         .         .          .    .    .            .comm   A,80000000,32
         .    .    .          .         .         .          .    .    .    .comm   B,80000000,32
         .    .    .          .         .         .          .    .    .    .text
         .    .    .          .         .         .          .    .    .    .globl   main
         .    .    .          .         .         .          .    .    .    .type   main, @function
         .    .    .          .         .         .          .    .    .  main:
         .    .    .          .         .         .          .    .    .  .LFB0:
         .    .    .          .         .         .          .    .    .    .cfi_startproc
         1    0    0          0         0         0          1    0    0    pushq   %rbp
         .    .    .          .         .         .          .    .    .    .cfi_def_cfa_offset 16
         .    .    .          .         .         .          .    .    .    .cfi_offset 6, -16
         1    0    0          0         0         0          0    0    0    movq    %rsp, %rbp
         .    .    .          .         .         .          .    .    .    .cfi_def_cfa_register 6
         1    0    0          0         0         0          0    0    0    movl    $0, %eax
         1    1    1          0         0         0          1    0    0    movq    %rax, -16(%rbp)
         1    0    0          0         0         0          1    0    0    movl    $0, -4(%rbp)
         1    0    0          0         0         0          0    0    0    jmp .L2
         .    .    .          .         .         .          .    .    .  .L3:
10,000,000    0    0 10,000,000         0         0          0    0    0    movl    -4(%rbp), %eax
10,000,000    0    0          0         0         0          0    0    0    cltq
10,000,000    0    0 10,000,000 1,250,000 1,250,000          0    0    0    movsd   A(,%rax,8), %xmm1 
10,000,000    0    0 10,000,000         0         0          0    0    0    movl    -4(%rbp), %eax
10,000,000    0    0          0         0         0          0    0    0    cltq
10,000,000    0    0 10,000,000         0         0          0    0    0    movsd   A(,%rax,8), %xmm0
10,000,000    0    0          0         0         0          0    0    0    mulsd   %xmm1, %xmm0
10,000,000    0    0          0         0         0 10,000,000    0    0    movsd   %xmm0, -16(%rbp)
10,000,000    0    0 10,000,000         0         0          0    0    0    addl    $1, -4(%rbp)
         .    .    .          .         .         .          .    .    .  .L2:
10,000,001    0    0 10,000,001         0         0          0    0    0    cmpl    $9999999, -4(%rbp)
10,000,001    0    0          0         0         0          0    0    0    jle .L3
         1    0    0          0         0         0          0    0    0    movl    $0, %eax
         1    0    0          1         0         0          0    0    0    popq    %rbp
         .    .    .          .         .         .          .    .    .    .cfi_def_cfa 7, 8
         1    0    0          1         0         0          0    0    0    ret
         .    .    .          .         .         .          .    .    .    .cfi_endproc
         .    .    .          .         .         .          .    .    .  .LFE0:
         .    .    .          .         .         .          .    .    .    .size   main, .-main
         .    .    .          .         .         .          .    .    .    .ident  "GCC: (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3"
         .    .    .          .         .         .          .    .    .    .section    .note.GNU-stack,"",@progbits

--------------------------------------------------------------------------------
 Ir I1mr ILmr  Dr D1mr DLmr  Dw D1mw DLmw 
--------------------------------------------------------------------------------
100    0    0 100  100  100 100    0    0  percentage of events annotated

3. Following result is perf's output:

$ sudo perf stat -r 10 -e instructions -e cache-references -e cache-misses -e L1-dcache-loads -e L1-dcache-load-misses -e L1-dcache-stores -e L1-dcache-store-misses -e LLC-loads -e LLC-load-misses -e LLC-prefetches ./test

 Performance counter stats for './test' (10 runs):

   113,898,951 instructions              #    0.00  insns per cycle          ( +- 12.73% ) [17.36%]
        53,607 cache-references                                              ( +- 12.92% ) [29.23%]
         1,483 cache-misses              #    2.767 % of all cache refs      ( +- 26.66% ) [39.84%]
    48,612,823 L1-dcache-loads                                               ( +-  4.58% ) [50.45%]
        34,256 L1-dcache-load-misses     #    0.07% of all L1-dcache hits    ( +- 18.94% ) [54.38%]
    14,992,686 L1-dcache-stores                                              ( +-  4.90% ) [52.58%]
         1,980 L1-dcache-store-misses                                        ( +-  6.36% ) [61.83%]
         1,154 LLC-loads                                                     ( +- 61.14% ) [53.22%]
            18 LLC-load-misses           #    1.60% of all LL-cache hits     ( +- 16.26% ) [10.87%]
             0 LLC-prefetches                                               [ 0.00%]

   0.037949840 seconds time elapsed                                          ( +-  3.57% )

More Experimental result(2014.05.13):

jin@desktop:~/1_dev/99_test/OI$ sudo perf stat -r 10 -e instructions -e r53024e -e r53014e -e L1-dcache-loads -e L1-dcache-load-misses -e r500f0a -e r500109 ./test

 Performance counter stats for './test' (10 runs):

   116,464,390 instructions              #    0.00  insns per cycle          ( +-  2.67% ) [67.43%]
         5,994 r53024e  <-- L1D hardware prefetch misses                     ( +- 21.74% ) [70.92%]
     1,387,214 r53014e  <-- L1D hardware prefetch requests                   ( +-  2.37% ) [75.61%]
    61,667,802 L1-dcache-loads                                               ( +-  1.27% ) [78.12%]
        26,297 L1-dcache-load-misses     #    0.04% of all L1-dcache hits    ( +- 48.92% ) [43.24%]
             0 r500f0a  <-- LLC lines allocated                                 [56.71%]
        41,545 r500109  <-- Number of LLC read misses                        ( +-  6.16% ) [50.08%]

   0.037080925 seconds time elapsed

In above result, the number of "L1D hardware prefetch request" seems like D1 miss(1,250,000) on cachegrind.

In my conclusion, if memory access the "stream pattern", then L1D prefetch function is enabled. and I can't check how many byte load from the memory due to LLC miss information.

Is my conclusion correct?

Editor's notes:
(1) According to the output of cachegrind, the OP was most probably using gcc 4.6.3 with no optimizations.
(2) Some of the raw events used in perf stat are only officially supported on Nehalem/Westmere, so I think that's the microarchitecture the OP is using.
(3) The bits set in most signifcant byte (i.e., third byte) in the raw event codes are ignored by perf. (Although not all bits of the third byte are ignored.) So the events effectively are r024e, r014e, r0f0a, and r0109.
(4) The events r0f0a and r0109 are uncore events, but the OP has specified them as core events, which is wrong because perf will measure them as core events.

Could you add the command lines you used, please? gcc with its options, perf stat ... — amigadev, May 12 '14 at 13:49
sorry for late answer. command line is followed: #> sudo perf stat -r 10 -e instructions -e cache-references -e cache-misses -e L1-dcache-loads -e L1-dcache-load-misses -e L1-dcache-stores -e L1-dcache-store-misses -e LLC-loads -e LLC-load-misses -e LLC-prefetches ./test — libertyjin, May 13 '14 at 12:28
Are you sure your initial program isn't being optimized out by the compiler? Specifically, the main loop can be bypassed since A[i] * A[i] isn't being saved between iterations (If you use a double array for temp, this should solve the problem). I suspect the compiler is optimizing out your microbenchmark. — , Jan 26 '15 at 16:48
I am having the same issue. Did you figure out the reason for this discrepancy? — chamibuddhika, Mar 14 '15 at 02:13

score 1 · Answer 1 · answered May 01 '15 at 07:20

Bottom line: your assumption regarding prefetches is correct, but your workaround isn't.

First, as Carlo pointed out, this loop would usually get optimized out by any compiler. Since both perf and cachegrind show ~100M instructions do retire, I guess you didn't compile with optimizations, which means the behavior isn't very realistic - for example, your loop variable may be stored in memory instead of in a register, adding pointless memory accesses and skewing cache counters.

Now, the difference between your runs is that cachgrind is just a cache simulator, it doesn't simulate prefetches, so every first access to a line misses as expected. On the other hand, the real CPU does have HW prefetches as you can see, so the first time each line is brought from memory, it's done by a prefetch (thanks to the simple streaming pattern), and not by an actual demand load. This is why perf misses counting these accesses with the normal counters.

You can see that when enabling the prefetch counter, you see roughly the same N/8 prefetches (plus some additional ones from other types of accesses probably).

Disabling the prefetcher would seem the right thing, however most CPUs don't offer too much control over that. You didn't specify what processor type you're using, but if it was Intel for example, you can see here that only the L2 prefetches are controlled by the BIOS, while your output shows L1 prefetches - https://software.intel.com/en-us/articles/optimizing-application-performance-on-intel-coret-microarchitecture-using-hardware-implemented-prefetchers

Search the manuals for your CPU type to see which L1 prefetchers exist, and understand how to work around them. Usually a simple stride (larger than a single cache line) should suffice to trick them, but if that doesn't work, you'll need to change your access pattern to be more random. You can randomize some permutation of indices for that.

You can disable almost all of the prefetchers, including the two L1 prefetchers on most modern Intel chips easily at the command line. See [here](https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors). — BeeOnRope, Oct 27 '19 at 14:25
@BeeOnRope, Right. I think back when I wrote this the MSRs were supported, but not all BIOSes had that option so you'd have to do it manually (and have the privileges for it). Didn't wan't to complicate the answer — Leeor, Oct 27 '19 at 20:03

I don't understand cache miss count between cachegrind vs. perf tool

1. Here is a simple micro-benchmark program.

2. Following result is cachegrind's output:

3. Following result is perf's output:

More Experimental result(2014.05.13):

1 Answers1

Linked