3

I've been searching for a Linux sampling profiler, and callgrind has come the closest to showing useful results. However the overhead is estimated at 20--100x slower than normal. Additionally, I'm only interested in time spent per function (with particular emphasis on blocking calls such as read() and write(), which no other profiler will faithfully display).

  1. Is there a way to turn off excess options, so that just the minimum data is recorded for generating times spent in various call stacks?
  2. Does callgrind's cachegrind heritage imply that excess stuff is being done with regards to cache profiling etc?
  3. I assume callgrind operates like a debugger. Can this be adjusted to sample the process at intervals, rather than every single instruction?
osgx
  • 90,338
  • 53
  • 357
  • 513
Matt Joiner
  • 112,946
  • 110
  • 377
  • 526

3 Answers3

2

3) Callgrind is working like dynamic translator, which instruments orginal code with counting instrument code. Instrumenting is done for each memory access instruction in the code (for cache simulation), and (i suggest) for each jmp-like instruction to track exec. count of every basic block.

I have a small sampling profiler, which acts just like debugger; It does inject a setitimer() profiling counter into the application and then it does intercept all SIGALRM and prints current $eip value.

There were some sampling profilers with setitimer approach earlier, also there is a profil()for something like. This is used by glibc/gmon/gmon.c and gprof -p (to be exact, by gcc -pg). profil() function is able to profile single contonous code fragment with sampling a virtual cpu time each 1 or 10 millisecond. There is also sprofil() function.

Check also LD_PRELOAD=/lib/libpcprofile.so PCPROFILE_OUTPUT=output.file - but I don't know does it work or how it work

For numbered questions:

2) "Callgrind is an extension to Cachegrind. It provides all the information that Cachegrind does, plus extra information about callgraphs." - So it can provide any stuff that is in cachegrind, but also it allow user to turn off cache simulation: --simulate-cache=no (it is the default value)

For speed: According to http://www.valgrind.org/docs/manual/nl-manual.html - manual of Nul valgrind tool (aka nulgrind), which does no additional instrumentation, slowdown is 5 times. It is because program is dynamically translated by valgrind itself. So, there can be no tool for valgrind, which can work faster then nulgrind.

osgx
  • 90,338
  • 53
  • 357
  • 513
  • Here is activation of gmon http://stackoverflow.com/questions/2279754/profiling-a-c-or-c-based-application-that-never-exits/2280519#2280519 – osgx May 11 '11 at 15:41
0

Try using Zoom from RotateRight. It has a "Thread Time" configuration that samples all threads in a single process whether they are running or blocked.

federal
  • 596
  • 5
  • 10
  • Yeah but $399. Certainly this appears to be the best product available for my purposes, but not for hobby programming. – Matt Joiner Sep 06 '12 at 06:44
0

Have you tried gprof ? It does not have the big overhead as valgrind do.

Zitrax
  • 19,036
  • 20
  • 88
  • 110
  • 2
    Unfortunately gprof does not do sampling, it _only_ counts _CPU_ time for instrumented functions. (Feel free to correct me if you know better but this is what I understand). – Matt Joiner Sep 10 '10 at 07:30
  • http://sourceware.org/binutils/docs/gprof/Implementation.html#Implementation "Profiling also involves watching your program as it runs, and keeping a histogram of where the program counter happens to be every now and then." - so the histogram from gprof is actually a pc sampling profile – osgx May 11 '11 at 15:41
  • http://sourceware.org/binutils/docs/gprof/Flat-Profile.html#Flat-Profile - Flat profile is based mainly on sampling. `gprof -p`, first two columns. – osgx May 12 '11 at 14:50
  • It uses sampling, but is useless to the OP because it only measures CPU time, not wall clock time, and OP mentions he is interested in particular in the blocking calls like read() and write(). See also http://stackoverflow.com/questions/11762372/profiling-a-possibly-i-o-bound-process-to-reduce-latency – Arnout Engelen Aug 02 '12 at 14:48
  • @osgx: The words used in the description of flat profile suggest walltime-based sampling, but I do not believe this to be true. Am I wrong here? – Matt Joiner Sep 06 '12 at 06:43
  • Matt Joiner, http://www.cs.utah.edu/dept/old/texinfo/as/gprof.html#SEC12 "gprof gives you are based on a sampling process...The sampling period that is printed at the beginning of the flat profile says how often samples are taken...sampling period is 0.01" - this is exactly the HZ on older linux, and this is the lowest possible period to use `setitimer` on them. Just strace -pg instrumented program for `setitimer(2)` or `profil(2)` (or grep monstartup/moncontrol sources in glibc gmon/gmon.c). On linux/Posix, profil emulated with setitimer (glibc sysdeps/posix/profil.c). – osgx Sep 06 '12 at 13:14
  • Matt Joiner, the paper docs.freebsd.org/44doc/psd/18.gprof/paper.pdf which was linked to in http://en.wikipedia.org/wiki/Gprof says in "3.2. Execution Times": "A second method samples the value of the program counter at some interval...all that is needed is the ability to set and respond to ‘‘alarm clock’’ interrupts that run relative to program time." – osgx Sep 06 '12 at 13:17