How does perf record (or other profilers) pick which instruction to count as costing time?

Question

Recently, I found out that actually perf (or pprof) may show in disassembly view instruction timing near the line that didn't actually take this time. The real instruction, which actually took this time, is before it. I know a vague explanation that this happens due to instruction pipelining in CPU. However, I would like to find out the following:

Is there a more detailed explanation of this effect?
Is it documented in perf or pprof? I haven't found any references.
Is there a way to obtain correctly placed timings?

I think problem do not directly comes from perf but the CPU. More fundamentally, this is related to how we should measure the time. Indeed, modern CPUs can execute the instruction in an out-of-order way and *in parallel*. What should we do if an instruction like a `div` is very slow but can be fully overlapped with other instructions? Should we report the `div` as being slow or free? What should we do if an instruction is heavily delayed because it can only be executed after another one on the critical path? And what if this instruction include a delayed memory access? — Jérôme Richard, Sep 27 '21 at 18:07
Generally the CPU "blames" the instruction that was waiting for a slow-to-produce result, not the one producing it, especially cache-miss loads. For an example with Intel x86 CPUs, see [Why is this jump instruction so expensive when performing pointer chasing?](https://stackoverflow.com/a/65708143), which also apparently depends on the effect of letting the last instruction in the ROB retire when an interrupt is raised. IDK if other CPU designs would have any significant differences, e.g. an ARM or AMD, and you didn't tag [tag:intel-pmu] or even mention x86. — Peter Cordes, Sep 27 '21 at 18:17

Peter Cordes · Accepted Answer · 2022-06-02T05:59:37.297

(quick not super detailed answer; a more detailed one would be good if someone wants to write one).

perf just uses the CPU's own hardware performance counters, which can be put into a mode where they record an event when the counter counts down to zero or up to a threshold.

Either raising an interrupt or writing an event into a buffer in memory (with PEBS precise events). That event will include a code address that the CPU picked to associate with the event (i.e. the point at which the interrupt was raised), even for events like cycles which unlike instructions don't inherently have a specific instruction associated. The out-of-order exec back-end can have a couple hundred instructions in flight when counter wraps, but has to pick exactly one for any given sample.

Generally the CPU "blames" the instruction that was waiting for a slow-to-produce result, not the one producing it, especially cache-miss loads.

For an example with Intel x86 CPUs, see Why is this jump instruction so expensive when performing pointer chasing? which also appears to depend on the effect of letting the last instruction in the ROB retire when an interrupt is raised. (Intel CPUs at least do seem to do that; makes sense for ensuring forward progress even with a potentially slow instruction.)

In general there can be "skew" when a later instruction is blamed than the one actually taking the time, possibly with different causes. (Perhaps especially for uncore events, since they happen asynchronously to the core clock.)

Other related Q&As with interesting examples or other things

Inconsistent `perf annotate` memory load/store time reporting
Linux perf reporting cache misses for unexpected instruction
https://travisdowns.github.io/blog/2019/08/20/interrupts.html - some experiments into which instructions tend to get counts on Skylake.

Travis Down's has a nice investigation of which instruction will be picked [here](https://travisdowns.github.io/blog/2019/08/20/interrupts.html) — Noah, Sep 27 '21 at 23:55
Any links to documentations on setting up interrupts after event thresholds? — Noah, Sep 28 '21 at 00:11
@Noah: What do you mean "setting up interrupts"? You mean how to write a kernel with an interrupt handler that's called when a PMU event fires, or to collect a PEBS buffer full of events? There are some light-weight alternatives to PAPI that have their own code for that, as kernel modules. Like libpfm. — Peter Cordes, Sep 28 '21 at 00:33
yeah, I guess where in the kernel this logic is. Havent found it yet. — Noah, Sep 28 '21 at 00:35
@Noah: If you just want to reproduce Travis's results (recording which instruction was interrupted) for PMU-triggered interrupts, `perf record -e cycles` or maybe `ref_cycles` will basically do that, and you can configure the thresholds manually with perf options; you don't need to install your own interrupt handler. But for curiosity, sorry IDK where to look exactly. — Peter Cordes, Sep 28 '21 at 00:37

How does perf record (or other profilers) pick which instruction to count as costing time?

1 Answers1

Linked