How to use linux `perf` tool to generate "Off-CPU" profile

Question

Brendan D. Gregg (author of DTrace book) has interesting variant of profiling: the "Off-CPU" profiling (and Off-CPU Flame Graph; slides 2013, p112-137) to see, where the thread or application were blocked (was not executed by CPU, but waiting for I/O, pagefault handler, or descheduled due short of CPU resources):

This time reveals which code-paths are blocked and waiting while off-CPU, and for how long exactly. This differs from traditional profiling which often samples the activity of threads at a given interval, and (usually) only examine threads if they are executing work on-CPU.

He also can combine Off-CPU profile data and On-CPU profile together: http://www.brendangregg.com/FlameGraphs/hotcoldflamegraphs.html

The examples given by Gregg are made using dtrace, which is not usually available in Linux OS. But there are some similar tools (ktap, systemtap, perf) and the perf as I think has widest installed base. Usually perf generated On-CPU profiles (which functions were executed more often on CPU).

How can I translate Gregg's Off-CPU examples to perf profiling tool in Linux?

PS: There is link to systemtap variant of Off-CPU flamegraphs in the slides from LISA13, p124: "Yichun Zhang created these, and has been using them on Linux with SystemTap to collect the proﬁle data. See: • http://agentzh.org/misc/slides/off-cpu-flame-graphs.pdf"" (CloudFlare Beer Meeting on 23 August 2013)

Related question - profiling sleep times with perf: http://stackoverflow.com/questions/21756067/profiling-sleep-times-with-perf — osgx, May 17 '14 at 16:41

score 15 · Accepted Answer · answered Jun 27 '16 at 20:42

The perf technique I published[1] was a high-overhead workaround, until perf has BPF support for doing this.

Right now, the lowest cost way of generating an off-CPU flame graph on Linux is on a 4.6+ kernel (which has BPF stack trace support), and with bcc/BPF. I wrote a tool for it, offcputime[2], which can be run with a -f option for "folded output", suitable for feeding into flamegraph.pl. This offcputime tool does the timing and stack counting all in kernel content, and dumps a report that is then printed with symbols.

One day, I expect that perf itself will be able to do this as well: run a BPF program that does the in-kernel counting, and dumping of a report.

In the meantime, we can use bcc/BPF. If for some reason you can't use bcc, you can, right now, take that offcputime program and write it in C. A more complicated version is available in the Linux source, as samples/bpf/offwaketime*. With the new BPF features on Linux, if there's a will, there's a way.

[1] http://www.brendangregg.com/blog/2015-02-26/linux-perf-off-cpu-flame-graph.html

[2] https://github.com/iovisor/bcc/blob/master/tools/offcputime_example.txt

score 5 · Answer 2 · edited Jun 20 '20 at 09:12

5

Brendan Gregg published instruction about Off-cpu flame graph generating: http://www.brendangregg.com/blog/2015-02-26/linux-perf-off-cpu-flame-graph.html and https://github.com/brendangregg/FlameGraph/issues/47#

Off-CPU time flame graphs may solve (say) 60% of the issues, with the remainder requiring walking the thread wakeups to find root cause. I explained off-CPU time flame graphs, this wakeup issue, and additional work, in my LISA13 talk on flame graphs (slides, youtube).

Here I'll show one way to do off-CPU time flame graphs using Linux perf_events.

# perf record -e sched:sched_stat_sleep -e sched:sched_switch \
 -e sched:sched_process_exit -a -g -o perf.data.raw sleep 1
# perf inject -v -s -i perf.data.raw -o perf.data
# perf script -f comm,pid,tid,cpu,time,period,event,ip,sym,dso,trace | awk '
NF > 4 { exec = $1; period_ms = int($5 / 1000000) }
NF > 1 && NF <= 4 && period_ms > 0 { print $2 }
NF < 2 && period_ms > 0 { printf "%s\n%d\n\n", exec, period_ms }' | \
./stackcollapse.pl | \
./flamegraph.pl --countname=ms --title="Off-CPU Time Flame Graph" --colors=io > offcpu.svg

stackcollapse.pl and flamegraph.pl from Gregg are used to draw flamegraph.

There are perf options used from 3.17 kernels and newer...

edited Jun 20 '20 at 09:12

Community

1
1

answered Feb 28 '15 at 17:18

osgx

90,338
53
357
513

Don't get sucked in by pretty pixels, enthusiastic hand-waving, or new buzzwords like on- and off-CPU. If you're not just "analyzing performance", but rather actively seeking maximum performance, you have to find the "bottlenecks" that are trying to hide from you, and they can easily hide in flame graphs. Check [*here*](http://stackoverflow.com/a/27867426/23771). – Mike Dunlavey Mar 01 '15 at 14:19
Mike, is there many modern profilers that can give profile "derived from wall-time samples"? Do you think that random stack sampling is modern (or THE ONLY RIGHT) profiling technology? – osgx Mar 01 '15 at 19:21
2

What I'm saying is 1) If a speedup would save fraction X of time, the average number of samples needed to expose it twice is 2/X, not thousands. (Gregg gave an example of 2000x, saving X=0.9995 of the time - *two or even one* sample would have nailed it.) 2) If thousands are taken and summarized (as in a flame graph or any other summary) the insight is lost that tells you precisely what the speedup is, and you *can't afford to miss any*. 3) There may be a better way to do it - let's see what it is. – Mike Dunlavey Mar 01 '15 at 23:17
1

Millions of programmers can be infected with a hurtful idea, namely that finding speedups needs lots of samples, summarized. The idea has no grounding, theoretical or practical. The way it hurts them is the summarizing masks actual speedups, which are simply not found. Some people are not so crowd-influenced, like [*Jon Bentley, slide 35*](http://dimacs.rutgers.edu/Workshops/EAA/slides/bentley.pdf), [*Agner Fog, page 18*](http://www.agner.org/optimize/optimizing_cpp.pdf), and those who voted on [*this*](http://stackoverflow.com/a/378024/23771). – Mike Dunlavey Mar 02 '15 at 21:52
Dunlavey, I think that what is called "Off-CPU" here can be useful, and it will bring perf closer to the your "wall time" profiling instead of traditional `gprof`/`perf record` "cpu time" profiling. PS: do you have online copy of your sigplan 2007 http://dl.acm.org/citation.cfm?id=1294298 PPS: what about expaning the https://en.wikipedia.org/wiki/Deep_sampling article to the introduction of the method, with adding some of external sources (not from your papers)? – osgx Mar 03 '15 at 04:22
PPPS: 2000x speedup example (I think you mean this http://dtrace.org/blogs/brendan/2011/12/08/2000x-performance-win/) was based not on new buzzwords (on/off cpu), but on old Sun's marketing - Dtrace; but actually problem was identified by taking stack samples. The only difference is that he used 2000 of samples because he can, and he had 3 different stacks each of them pointing into mbtowc or multibyte code. Yes, one or two samples was enough for the problem, but the example was about dtrace marketing (http://www.slideshare.net/brendangregg/from-dtrace-to-linux - slide 14 "Dtrace had awesome") – osgx Mar 03 '15 at 05:27
Actually the 2000x came from his video, where he claimed he got it as a result of a flame graph, implying it might not have been found without the flame graph. I'm just saying a single poke or two would also have found it. What's more, the flame graph would not have aggregated it if it were numerous short calls sprinkled around the call tree, Really big apps don't have simple call trees where time-hogs just sit there always being reached by the same path. – Mike Dunlavey Mar 03 '15 at 14:17
If you send me an email address, I will send you my latest usenix submission. It contains references. (It may not get accepted for the conference - it's amazing how pervasive the misconception is.) – Mike Dunlavey Mar 03 '15 at 15:17
`perf script` has no input from `perf.data`, how to fix it? – skytree Mar 23 '19 at 19:59
@skytree, the correct way of fixing it would be not adding comments to 4 year old post, but opening new question and giving full details: Your os version, your kernel version (`uname -a`), exact commands you started and their output, what was expected and why, was the command working for you earlier, with other kernel, other OS, or for anybody else. – osgx Mar 24 '19 at 07:52

How to use linux `perf` tool to generate "Off-CPU" profile

2 Answers2

Linked