I tried to count the number of instructions of add loop application in RISC-V FPGA, using very simple RV32IM core with Linux 5.4.0 buildroot.
add.c:
int main()
{
int a = 0;
for (int i = 0; i < 1024*1024; i++)
a++;
printf("RESULT: %d\n", a);
return a;
}
I used -O0 compile option so that the loop really loop, and the resulting dump file is following:
000103c8 <main>:
103c8: fe010113 addi sp,sp,-32
103cc: 00812e23 sw s0,28(sp)
103d0: 02010413 addi s0,sp,32
103d4: fe042623 sw zero,-20(s0)
103d8: fe042423 sw zero,-24(s0)
103dc: 01c0006f j 103f8 <main+0x30>
103e0: fec42783 lw a5,-20(s0)
103e4: 00178793 addi a5,a5,1 # 12001 <__TMC_END__+0x1>
103e8: fef42623 sw a5,-20(s0)
103ec: fe842783 lw a5,-24(s0)
103f0: 00178793 addi a5,a5,1
103f4: fef42423 sw a5,-24(s0)
103f8: fe842703 lw a4,-24(s0)
103fc: 001007b7 lui a5,0x100
10400: fef740e3 blt a4,a5,103e0 <main+0x18>
10404: fec42783 lw a5,-20(s0)
10408: 00078513 mv a0,a5
1040c: 01c12403 lw s0,28(sp)
10410: 02010113 addi sp,sp,32
10414: 00008067 ret
As you can see, the application loops from 103e0 ~ 10400, which is 9 instructions, so the number of total instruction must be at least 9 * 1024^2 But the result of perf stat is pretty weird
RESULT: 1048576
Performance counter stats for './add.out':
3170.45 msec task-clock # 0.841 CPUs utilized
20 context-switches # 0.006 K/sec
0 cpu-migrations # 0.000 K/sec
38 page-faults # 0.012 K/sec
156192046 cycles # 0.049 GHz (11.17%)
8482441 instructions # 0.05 insn per cycle (11.12%)
1145775 branches # 0.361 M/sec (11.25%)
3.771031341 seconds time elapsed
0.075933000 seconds user
3.559385000 seconds sys
The total number of instructions perf counted was lower than 9 * 1024^2. Difference is about 10%. How is this happening? I think the output of perf should be larger than that, because perf tool measures not only overall add.out, but also overhead of perf itself and context-switching.