3

The following is output from the perf Linux profiler on C++ code produced by gcc. I am calculating (a[i]+b[i])^c[i] in a loop going from i=n downwards until the loop exists at i=-1. This is by far the hottest loop in my program which can run for hours or days.

If I am understanding this output correctly, perf is telling me that 57% of the time in this function is spent on subtracting 8 from the rdx register. That seems unlikely, seeing as subtracting 1 from the rcx register three lines above is only taking 0.99% of the time. I think I must be missing something. What is the explanation for these numbers? Is the time for the previous instructions somehow unfairly getting charged to the subtraction?

    3.64 :          484388:       mov    0x0(%rbp,%rdx,1),%rax
    0.64 :          48438d:       add    (%rbx,%rdx,1),%rax
    0.99 :          484391:       sub    $0x1,%rcx
    3.60 :          484395:       xor    (%rdi,%rdx,1),%rax
   57.13 :          484399:       sub    $0x8,%rdx
    0.22 :          48439d:       or     %rax,%rsi
    4.23 :          4843a0:       cmp    $0xffffffffffffffff,%rcx
    0.00 :          4843a4:       jne    484388

I got these numbers by doing "perf record ./myprogram", then "perf report" in the same directory and then I browsed to this piece of assembly.

Bjarke H. Roune
  • 3,667
  • 2
  • 22
  • 26
  • Try moving the `sub` down two instructions? Also, GCC's vector extensions will let you take advantage of SSE2/3/etc while being vaguely portable and not too difficult to use. – tc. Oct 26 '12 at 20:19
  • AFWIW, if you run `perf record` without specifying specific events (via `--events cycles,cache-misses,branch-misses` e.g.), it'll record `cycles`. The exact meaning of that is described here: https://perf.wiki.kernel.org/index.php/Tutorial#Default_event:_cycle_counting – oberstet Nov 03 '13 at 22:23

1 Answers1

2

I found this on the perf wiki:

Interrupt-based sampling introduces skids on modern processors. That means that the instruction pointer stored in each sample designates the place where the program was interrupted to process the PMU interrupt, not the place where the counter actually overflows, i.e., where it was at the end of the sampling period. In some case, the distance between those two points may be several dozen instructions or more if there were taken branches. When the program cannot make forward progress, those two locations are indeed identical. For this reason, care must be taken when interpreting profiles.

That may be the explanation. Unfortunately, the wiki does not say how to figure out if this is indeed the problem or how to correct for this issue.

Bjarke H. Roune
  • 3,667
  • 2
  • 22
  • 26
  • 1
    I'd also be interested in an explanation .. the only other than "skid" is hinted here: http://stackoverflow.com/questions/17010178/instruction-level-profiling-the-meaning-of-the-instruction-pointer – oberstet Nov 03 '13 at 22:21