Implementing an Interrupt driven Sampling Profiler

Question

I am trying to create a sampling profiler that works on linux, I am unsure how to send an interrupt or how to get the program counter (pc) so I can find out the point the program is at when I interrupt it.

I have tried using signal(SIGUSR1, Foo*) and calling backtrace, but I get the stack for the thread I am in when I raise(SIGUSR1) rather than the thread the program is being run on. I am not really sure if this is even the right way of going about it...

Any advice?

Why not use `perf record` ... that's what it does. http://en.wikipedia.org/wiki/Perf_(Linux) — amdn, May 12 '14 at 21:08
I need to implement it myself, rather than using pre-existing tools. I am looking into creating the tools — mattthehoopy, May 12 '14 at 21:12
Under GDB I generate the signal, then do `thread 1` to get on the proper thread, then do `bt` to get the stack. If you want to read the stack, there might be a programmatic tool to do it. In the old days I did it the hard way, examining the return pointer, looking at the instructions, etc. You still need the equivalent of a link-map file, so you can convert addresses to function names and line numbers. — Mike Dunlavey, May 12 '14 at 21:13
I'm not sure what you mean by "do thread 1". Do you mean there is a way to switch a thread mid function? I wasn't aware such a thing existed... If I could do that then the rest wouldn't be a problem. — mattthehoopy, May 12 '14 at 21:32
Take a look at `gstack` it is a wrapper script that invokes GDB to print the backtrace for each thread http://linux.die.net/man/1/gstack, also this SO question http://stackoverflow.com/questions/10833317/pthread-threads-and-signals — amdn, May 12 '14 at 21:47

Mike Dunlavey · Answer 1 · 2014-05-13T02:00:26.463

If you must write a profiler, let me suggest you use a good one (Zoom) as your model, not a bad one (gprof). These are its characteristics.

There are two phases. First is the data-gathering phase:

When it takes a sample, it reads the whole call stack, not just the program counter.
It can take samples even when the process is blocked due to I/O, sleep, or anything else.
You can turn sampling on/off, so as to only take samples during times you care about. For example, while waiting for the user to type something, it is pointless to be sampling.

Second is the data-presentation phase. What you have is a collection of stack samples, where a stack sample is a vector of memory addresses, which are almost all return addresses. Each return address indicates a line of code in a function, unless it's in some system routine you don't have symbolic information for.

The key piece of useful information is residency fraction (usually expressed as a percent). If there are a total of m stack samples, and line of code L is present anywhere on n of the samples, then its residency fraction is n/m. This is true even if L appears more that once on a sample, that is still just one sample it appears on. The importance of residency fraction is it directly indicates what fraction of time statement L is responsible for. If you have taken m=1000 samples, and L appears on n=300 of them, then L's residency fraction is 300/1000 or 30%. This means that if L could be removed, total time would decrease by 30%. It is typically known as inclusive percent.

You can determine residency fraction not just for lines of code, but for anything else you can describe. For example, line of code L is inside some function F. So you can determine the residency fraction for functions, as opposed to lines of code. That would give you inclusive percent by function. You could look at function pairs, like on what fraction of samples do you see function F calling function G. That would give you the information that makes up call graphs.

There are all kinds of information you can get out of the stack samples. One that is often seen is a "butterfly view", where you have a "focus" on one line L or function F, and on one side you show all the lines or functions immediately above it in the stack samples, and on the other side all the lines of functions immediately below it. On each of these, you can show the residency fraction. You can click around in this to try to find lines of code with high residency fraction that you can find a way to eliminate or reduce. That's how you speed up the code.

Whatever you do for output, I think it is very important to allow the user to actually examine a small number of them, randomly selected. They convey far more insight than can be gotten from any method that condenses the information.

As important as it is to know what the profiler should do, it is also important to know what not to do, even if lots of other profilers do them:

self time. A useless number. Look at some reasonable-size programs and you will see why.
invocation counts. Of no help in finding code with high residency fraction, and you can't get it with samples alone anyway.
high-frequency sampling. It's amazing how many people, certainly profiler builders, think it is important to get lots of samples. Suppose line L is on 30% of 1000 samples. Then its true inclusive percent is 30 +/- 1.4 percent. On the other hand, if it is on 30% of 10 samples, its inclusive percent is 30 +/- 14 percent. It's still pretty big - big enough to fix. What happens in most profilers is people think they need "numerical precision", so they take lots of samples and accumulate what they call "statistics", and then throw away the samples. That's like digging up diamonds, weighing them, and throwing them away. The real value is in the samples themselves, because they tell you what the problem is.

"examine a small number of them, randomly selected" - is it the best possible model, sold for 1300 and 700 USD? Why ALL tools give me exact % for every function, and what will give your "random selection" (if it ever can't reproduce same results)? How should I use it to get low-level details about program execution like [here](http://stackoverflow.com/a/23355818/196561)? Not all program has single/low number of hot spots and I'm sure that your random profiler is not silver bullet. — osgx, May 13 '14 at 01:47
@osgx: My mistake - I was interrupted. Rather than debate it in words, there is a short video [*here*](https://www.youtube.com/watch?v=xPg3sRpdW1U) that leads you through it. I said "if you must write a profiler", because I'm ready to bet that no possible profiler, except maybe one based on artificial intelligence, can give you as much speedup performance gain as the purely manual method. — Mike Dunlavey, May 13 '14 at 02:19
Mike Dunlavey, sorry, the video is about your own random profiler not about Zoom. I already have my own non-random one... Do you have similar materials (video or pdf) for Zoom usage? PS: about 5:35 from your video. My experience is biased, but I can get not-summarized samples from several profilers, not only from my own. — osgx, May 13 '14 at 02:23
@osgx: There are demos on the Zoom sight. I do not use Zoom - I only mention it because, if one must use or build a profiler, it should be one like Zoom. I think there is no profiler that will outperform (in terms of speedup achieved) the manual method. I put an example [*here*](https://sourceforge.net/projects/randompausedemo/) where the speedup is 730x, and nobody has yet achieved that with a profiler. If you can get, and use, non-summarized stack samples, then that's the way to go. I do that myself with *rprof*. — Mike Dunlavey, May 13 '14 at 02:29

score 2 · Accepted Answer · edited May 23 '17 at 12:17

You can send signal to specific thread using pthread_kill and tid (gettid()) of target thread.

Right way of creating simple profilers is by using setitimer which can send periodic signal (SIGALRM or SIGPROF) for example, every 10 ms; or posix timers (timer_create, timer_settime, or timerfd), without needs of separate thread for sending profiling signals. Check sources of google-perftools (gperftools), they use setitimer or posix timers and collects profile with backtraces.

gprof also uses setitimer for implementing cpu time profiling (9.1 Implementation of Profiling - " Linux 2.0 ..arrangements are made for the kernel to periodically deliver a signal to the process (typically via setitimer())").

For example: result of codesearch for setitimer in gperftools's sources: https://code.google.com/p/gperftools/codesearch#search/&q=setitimer&sq=package:gperftools&type=cs

void ProfileHandler::StartTimer() {
  if (!allowed_) {
    return;
  }
  struct itimerval timer;
  timer.it_interval.tv_sec = 0;
  timer.it_interval.tv_usec = 1000000 / frequency_;
  timer.it_value = timer.it_interval;
  setitimer(timer_type_, &timer, 0);
}

You should know that setitimer has problems with fork and clone; it doesn't work with multithreaded applications. There is try to create helper wrapper: http://sam.zoy.org/writings/programming/gprof.html (wrong one) but I don't remember, does it work correctly (setitimer usually send process-wide signal, and not thread-wide). UPD: seems that since linux kernel 2.6.12, setitimer's signal is directed to the process as whole (any thread may get it).

To direct signal from timer_create to specific thread, you need gettid() (#include <sys/syscall.h>, syscall(__NR_gettid)) and SIGEV_THREAD_ID flag. Don't checked how to create periodic posix timer with thread_create (probably with timer_settime and non-zero it_interval).

PS: there is some overview of profiling in wikibooks: http://en.wikibooks.org/wiki/Introduction_to_Software_Engineering/Tools/Profiling

Nice thanks, am trying the pthread thing first as its close to what I am already doing, am getting the tid using syscall() as gettid isnt supported by my compiler, but it just hangs on the pthread_kill call... — mattthehoopy, May 13 '14 at 00:45
This post helped me solve the issue, the final step was using pthread functions to get my thread id for the main. Thanks for all the help :) — mattthehoopy, May 17 '14 at 14:23
mattthehoopy, So, if your question is fully answered, you may "accept" the best answer by clicking "v"-shaped button just under answer voting buttons. After accepting the question is considered closed. If you will have additional questions about creating your profiler, please add comment here with link to the question. PS: which pthread functions did you use to get thread id? — osgx, May 17 '14 at 16:39
Cool, sorry, my first thread. As it turns out you pass the pthread_t variable into the pthread_kill function. I got this variable using pthread_self() in the main thread :) — mattthehoopy, May 19 '14 at 00:50
mattthehoopy, it is unsafe to use pthread_t instead of tid. Write your `gettid` macro via `syscall`. IBM [says](http://publib.boulder.ibm.com/infocenter/iseries/v5r3/topic/apis/concep17.htm): "*Do not allow your program to rely on the internal structure or size of the pthread_t in a non-portable fashion, such as comparisons of thread IDs.*" — osgx, May 19 '14 at 06:21
Oh? I've had no issues so far. And syscall's gettid doesnt work for me in pthread_kill() (the only place i am using it) — mattthehoopy, May 22 '14 at 00:40

Implementing an Interrupt driven Sampling Profiler

2 Answers2