The following is output from the perf Linux profiler on C++ code produced by gcc. I am calculating (a[i]+b[i])^c[i] in a loop going from i=n downwards until the loop exists at i=-1. This is by far the hottest loop in my program which can run for hours or days.
If I am understanding this output correctly, perf is telling me that 57% of the time in this function is spent on subtracting 8 from the rdx register. That seems unlikely, seeing as subtracting 1 from the rcx register three lines above is only taking 0.99% of the time. I think I must be missing something. What is the explanation for these numbers? Is the time for the previous instructions somehow unfairly getting charged to the subtraction?
3.64 : 484388: mov 0x0(%rbp,%rdx,1),%rax
0.64 : 48438d: add (%rbx,%rdx,1),%rax
0.99 : 484391: sub $0x1,%rcx
3.60 : 484395: xor (%rdi,%rdx,1),%rax
57.13 : 484399: sub $0x8,%rdx
0.22 : 48439d: or %rax,%rsi
4.23 : 4843a0: cmp $0xffffffffffffffff,%rcx
0.00 : 4843a4: jne 484388
I got these numbers by doing "perf record ./myprogram", then "perf report" in the same directory and then I browsed to this piece of assembly.