I am studying about cache effect using a simple micro-benchmark.
I think that if N is bigger than cache size, then cache have a miss operation every first reading cache line.
In my machine, cache line size=64Byte, so I think totally cache occur N/8 miss operation and cache grind show that.
However, perf tool displays different result. It only occur 34,265 cache miss operations.
I am doubted about hardware prefetch, so turn off this function in BIOS. anyway, result is same.
I really don't know why perf tool's cache miss occur very small operations than "cachegrind".
Could someone give me a reasonable explanation?
1. Here is a simple micro-benchmark program.
#include <stdio.h>
#define N 10000000
double A[N];
int main(){
int i;
double temp=0.0;
for (i=0 ; i<N ; i++){
temp = A[i]*A[i];
}
return 0;
}
2. Following result is cachegrind's output:
==27612== Cachegrind, a cache and branch-prediction profiler
==27612== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al.
==27612== Using Valgrind-3.9.0 and LibVEX; rerun with -h for copyright info
==27612== Command: ./test
==27612==
--27612-- warning: L3 cache found, using its data for the LL simulation.
==27612==
==27612== I refs: 110,102,998
==27612== I1 misses: 728
==27612== LLi misses: 720
==27612== I1 miss rate: 0.00%
==27612== LLi miss rate: 0.00%
==27612==
==27612== D refs: 70,038,455 (60,026,965 rd + 10,011,490 wr)
==27612== D1 misses: 1,251,802 ( 1,251,288 rd + 514 wr)
==27612== LLd misses: 1,251,624 ( 1,251,137 rd + 487 wr)
==27612== D1 miss rate: 1.7% ( 2.0% + 0.0% )
==27612== LLd miss rate: 1.7% ( 2.0% + 0.0% )
==27612==
==27612== LL refs: 1,252,530 ( 1,252,016 rd + 514 wr)
==27612== LL misses: 1,252,344 ( 1,251,857 rd + 487 wr)
==27612== LL miss rate: 0.6% ( 0.7% + 0.0% )
Generate a report File
--------------------------------------------------------------------------------
I1 cache: 32768 B, 64 B, 4-way associative
D1 cache: 32768 B, 64 B, 8-way associative
LL cache: 8388608 B, 64 B, 16-way associative
Command: ./test
Data file: cache_block
Events recorded: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Events shown: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Thresholds: 0.1 100 100 100 100 100 100 100 100
Include dirs:
User annotated: /home/jin/1_dev/99_test/OI/test.s
Auto-annotation: off
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
--------------------------------------------------------------------------------
110,102,998 728 720 60,026,965 1,251,288 1,251,137 10,011,490 514 487 PROGRAM TOTALS
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw file:function
--------------------------------------------------------------------------------
110,000,011 1 1 60,000,003 1,250,000 1,250,000 10,000,003 0 0 /home/jin/1_dev/99_test/OI/test.s:main
--------------------------------------------------------------------------------
-- User-annotated source: /home/jin/1_dev/99_test/OI/test.s
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
-- line 2 ----------------------------------------
. . . . . . . . . .comm A,80000000,32
. . . . . . . . . .comm B,80000000,32
. . . . . . . . . .text
. . . . . . . . . .globl main
. . . . . . . . . .type main, @function
. . . . . . . . . main:
. . . . . . . . . .LFB0:
. . . . . . . . . .cfi_startproc
1 0 0 0 0 0 1 0 0 pushq %rbp
. . . . . . . . . .cfi_def_cfa_offset 16
. . . . . . . . . .cfi_offset 6, -16
1 0 0 0 0 0 0 0 0 movq %rsp, %rbp
. . . . . . . . . .cfi_def_cfa_register 6
1 0 0 0 0 0 0 0 0 movl $0, %eax
1 1 1 0 0 0 1 0 0 movq %rax, -16(%rbp)
1 0 0 0 0 0 1 0 0 movl $0, -4(%rbp)
1 0 0 0 0 0 0 0 0 jmp .L2
. . . . . . . . . .L3:
10,000,000 0 0 10,000,000 0 0 0 0 0 movl -4(%rbp), %eax
10,000,000 0 0 0 0 0 0 0 0 cltq
10,000,000 0 0 10,000,000 1,250,000 1,250,000 0 0 0 movsd A(,%rax,8), %xmm1
10,000,000 0 0 10,000,000 0 0 0 0 0 movl -4(%rbp), %eax
10,000,000 0 0 0 0 0 0 0 0 cltq
10,000,000 0 0 10,000,000 0 0 0 0 0 movsd A(,%rax,8), %xmm0
10,000,000 0 0 0 0 0 0 0 0 mulsd %xmm1, %xmm0
10,000,000 0 0 0 0 0 10,000,000 0 0 movsd %xmm0, -16(%rbp)
10,000,000 0 0 10,000,000 0 0 0 0 0 addl $1, -4(%rbp)
. . . . . . . . . .L2:
10,000,001 0 0 10,000,001 0 0 0 0 0 cmpl $9999999, -4(%rbp)
10,000,001 0 0 0 0 0 0 0 0 jle .L3
1 0 0 0 0 0 0 0 0 movl $0, %eax
1 0 0 1 0 0 0 0 0 popq %rbp
. . . . . . . . . .cfi_def_cfa 7, 8
1 0 0 1 0 0 0 0 0 ret
. . . . . . . . . .cfi_endproc
. . . . . . . . . .LFE0:
. . . . . . . . . .size main, .-main
. . . . . . . . . .ident "GCC: (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3"
. . . . . . . . . .section .note.GNU-stack,"",@progbits
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
--------------------------------------------------------------------------------
100 0 0 100 100 100 100 0 0 percentage of events annotated
3. Following result is perf's output:
$ sudo perf stat -r 10 -e instructions -e cache-references -e cache-misses -e L1-dcache-loads -e L1-dcache-load-misses -e L1-dcache-stores -e L1-dcache-store-misses -e LLC-loads -e LLC-load-misses -e LLC-prefetches ./test
Performance counter stats for './test' (10 runs):
113,898,951 instructions # 0.00 insns per cycle ( +- 12.73% ) [17.36%]
53,607 cache-references ( +- 12.92% ) [29.23%]
1,483 cache-misses # 2.767 % of all cache refs ( +- 26.66% ) [39.84%]
48,612,823 L1-dcache-loads ( +- 4.58% ) [50.45%]
34,256 L1-dcache-load-misses # 0.07% of all L1-dcache hits ( +- 18.94% ) [54.38%]
14,992,686 L1-dcache-stores ( +- 4.90% ) [52.58%]
1,980 L1-dcache-store-misses ( +- 6.36% ) [61.83%]
1,154 LLC-loads ( +- 61.14% ) [53.22%]
18 LLC-load-misses # 1.60% of all LL-cache hits ( +- 16.26% ) [10.87%]
0 LLC-prefetches [ 0.00%]
0.037949840 seconds time elapsed ( +- 3.57% )
More Experimental result(2014.05.13):
jin@desktop:~/1_dev/99_test/OI$ sudo perf stat -r 10 -e instructions -e r53024e -e r53014e -e L1-dcache-loads -e L1-dcache-load-misses -e r500f0a -e r500109 ./test
Performance counter stats for './test' (10 runs):
116,464,390 instructions # 0.00 insns per cycle ( +- 2.67% ) [67.43%]
5,994 r53024e <-- L1D hardware prefetch misses ( +- 21.74% ) [70.92%]
1,387,214 r53014e <-- L1D hardware prefetch requests ( +- 2.37% ) [75.61%]
61,667,802 L1-dcache-loads ( +- 1.27% ) [78.12%]
26,297 L1-dcache-load-misses # 0.04% of all L1-dcache hits ( +- 48.92% ) [43.24%]
0 r500f0a <-- LLC lines allocated [56.71%]
41,545 r500109 <-- Number of LLC read misses ( +- 6.16% ) [50.08%]
0.037080925 seconds time elapsed
In above result, the number of "L1D hardware prefetch request" seems like D1 miss(1,250,000) on cachegrind.
In my conclusion, if memory access the "stream pattern", then L1D prefetch function is enabled. and I can't check how many byte load from the memory due to LLC miss information.
Is my conclusion correct?
Editor's notes:
(1) According to the output of cachegrind
, the OP was most probably using gcc 4.6.3 with no optimizations.
(2) Some of the raw events used in perf stat
are only officially supported on Nehalem/Westmere, so I think that's the microarchitecture the OP is using.
(3) The bits set in most signifcant byte (i.e., third byte) in the raw event codes are ignored by perf
. (Although not all bits of the third byte are ignored.) So the events effectively are r024e, r014e, r0f0a, and r0109.
(4) The events r0f0a and r0109 are uncore events, but the OP has specified them as core events, which is wrong because perf
will measure them as core events.