8

I'm profiling a C++ application using linux perf, and I'm getting a nice control flow graph using GProf2dot. However, some symbols from the C library (libc6-2.13.so) take a substantial portion of the total time, and yet have no in-edges.

For example:

  • _int_malloc takes 8% of the time but has no call parents.
  • __strcmp_sse42 and __cxxabiv1::__si_class_type_info::__do_dyncast together take about 10% of the time, and have a caller whose name is 0, which has callers 2d6935c, 2cc748c, and 6, which have no callers.

As a result, I can't find out which routines are responsible for all this mallocing and dynamic casting using just perf. However, it seems that other symbols (e.g. malloc but not _int_malloc) do have call parents.

Why doesn't perf show call parents for _int_malloc? Why can't I find the ultimate callers of __do_dyn_cast? And, is there some way for me to modify my setup so that I can get this information? I'm on x86-64, so I'm wondering if I need a (non-standard) libc6 with frame pointers.

osgx
  • 90,338
  • 53
  • 357
  • 513
BenRI
  • 724
  • 6
  • 17

2 Answers2

5

Update: As of the 3.7.0 kernel, one can determine call parents of symbols in system libraries using perf record -gdwarf <command>.

Using -gdwarf, there is no need to compile with -fno-omit-frame-pointer.

Original answer: Yes, one probably would need a libc6 compiled with frame pointers (-fno-omit-framepointer) on x86_64, at the moment (May 24, 2012).

However, developers are currently working on allowing the perf tools to use DWARF unwind info. This means that frame pointers are no longer needed to get backtrace information on x86_64. Linus, however, does not want a DWARF unwinder in the kernel. Thus, the perf tools will save registers as the system is running, and perform the DWARF unwinding in the userspace perf tool using the libunwind library.

This technique has been tested to successfully determine callers of (for example) malloc and dynamic_cast. However, the patch set is not yet integrated into the Linux kernel, and needs to undergo further revision before it is ready.

BenRI
  • 724
  • 6
  • 17
1

_int_malloc and __do_dyn_cast are being called from routines that the profiler can't identify because it doesn't have symbol table information for them.

What's more, it looks like you are showing self (exclusive) time. That is only useful for finding hotspots in routines that a) have much self time, and b) you can fix.

There's a reason profilers subsequent to the original unix profil were created. Real software consists of functions that spend nearly all their time calling other functions, and you need to be able to find code that is on the stack much of the time, not that has the program counter much of the time.

So you need to configure perf to take stack samples and tell you the percent of time each of your routines is on the stack. It is even better if it reports not just routines, but lines of code, as in Zoom. It is best to take the samples on wall-clock time, so you're not blind to IO.

There's more to say on all this.

Community
  • 1
  • 1
Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
  • Thanks Mike. I've got both self time and caller time (I used perf record -g to get the call stacks). I can't do much about the time taken executing dynamic casts (does g++ really compare typeinfo with strcmp??) but I plan to make sure less dynamic casting (and memory allocation) is done. To do this it would be nice to know what's calling the dynamic cast and malloc functions. – BenRI Apr 20 '12 at 21:48
  • Thanks Mike. I've got both self time and caller time (I used perf record -g). Clearly I can't make malloc or dynamic_cast much faster, so I'm trying to find which routines are calling them and fix the worst offenders first. I think the problem is that the kernel may have trouble unwinding stack frames in some cases (note, not all cases) and I'm curious why the kernel is failing to unwind the stack and find the callers of (say) __strcmp_sse_42. I **suspect** that this is being used to compare typeinfo objects, but without knowing the callers, its hard to be sure. – BenRI Apr 20 '12 at 21:56
  • 1
    @BenRI: That link discusses how I do it, and *[this link](http://sourceforge.net/projects/randompausedemo/files/)* has a short slide show of getting an overall speedup ratio of about 700x by finding and fixing a series of problems of that sort. – Mike Dunlavey Apr 21 '12 at 01:28