I'm currently calling oprofile with these parameters:
operf --callgraph --vmlinux /usr/lib/debug/boot/vmlinux-$(uname -r) <BINARY>
opreport -a -l <BINARY>
As an example, the output is:
CPU: Core 2, speed 2e+06 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 90000
samples cum. samples % cum. % image name symbol name
12635 12635 27.7674 27.7674 libc-2.15.so __memset_sse2
9404 22039 20.6668 48.4342 vmlinux-3.5.0-21-generic get_page_from_freelist
4381 26420 9.6279 58.0621 vmlinux-3.5.0-21-generic native_flush_tlb_single
3684 30104 8.0962 66.1583 vmlinux-3.5.0-21-generic page_fault
701 30805 1.5406 67.6988 vmlinux-3.5.0-21-generic handle_pte_fault
You can see that most of the time is spent within __memset_sse2
but it is not obvious which of my own code should be optimized. At least not from the output above.
In my specific case, I was able to quickly locate the source of the problem by using some kind of poor man's profiler. I ran the program in a debugger, stopped it from time to time and looked at the call stacks of each thread.
Is it possible to get the same results directly from the output of oprofile? The strategy that I used will most likely fail if the performance bottleneck is not that obvious as it was in my example.
Is there an option to ignore all calls to external function (e.g., to the kernel or libc) and just accumulate the time to the caller? For example:
void foo() {
// some expensive call to memset...
}
Here, it would be more insightful for me to see foo
at the top of the profiling output, not memset
.
(I tried opreport --exclude-dependent
but found it not helpful as it seems only to skip the external functions in the output.)