1

I'm collecting program counter samples from a ARM Cortex M3. A long list like this:

  1. 0x8005b2a
  2. 0x8001324
  3. 0x8005b34
  4. 0x8001318

The pc is sampled periodically. I now want to have a static flat profile from the running program. Like (g)prof is doing with support of the linux kernel.

Is there a way to convert these PC samples in a (g)prof readable format or are there other tools that give me a profile based on these pc samples and an *.elf / *.lst file ?

artless noise
  • 21,212
  • 6
  • 68
  • 105
Hans Müller
  • 71
  • 1
  • 5
  • Are you doing this because you want to find ways to get more speed? If so, you will have much better luck with a [*small number of stack samples*](http://stackoverflow.com/a/378024/23771) than a large number of program counter samples. Check [*second answer here*](http://archive.is/9r927). – Mike Dunlavey Nov 06 '15 at 13:25
  • Yes I'm going to analyze how many cycles are spent in each function. I'm sampling all calling data (means: pc value of the caller function + pc value of calling function) + every 512 pc value. – Hans Müller Nov 08 '15 at 14:07
  • That means you want to periodically read not only the pc, but the entire call stack. Now, if you want to do more than just analyze, but you want to actually find bottlenecks so you can remove them, (you see - that's a different problem) here's how to do it. Any bottleneck takes a certain fraction of time, right? Suppose it is 20%. That means any random-time stack sample will show it with 20% probability. So if you start taking stack samples at random times, you should have seen the problem twice after 10 samples, on average. So the bigger it is, the fewer samples you need. Clear? – Mike Dunlavey Nov 09 '15 at 13:07
  • I think this is clear to me. I'm sampling all calling data, with overwriting the __gnu_mcount_nc function and passing -pg option to gcc. Additionally to this, I'm also sampling every 512 clock cycle the pc. This means to me, that I will see every bottleneck with is longer than 512 cycles and with a probability of X% the smaller ones. X should increase with the number of samples. With the stack sampling I get calling data to detect small often called functions. Am I right ? – Hans Müller Nov 10 '15 at 21:33
  • If I understand, you're getting two things: 1) PC samples every 512 cycles, and 2) some information on entering any function. What I suggest you do is run the program under a debugger or emulator so that you can manually interrupt it. When you do, get a stack trace and study it so you understand everything happening at that point in time. Be hopeful that whatever it's doing might be eliminated. Do this several times. Anything that could be eliminated, if you see it on >1 sample, is a nice big bottleneck. It will also find flocks of little birds just fine, don't worry. – Mike Dunlavey Nov 10 '15 at 21:50
  • Just yesterday, our project showed a painful slowness on a certain kind of input. There were all kinds of guesses of what could cause it. Today I took a bunch of stackshots. I examined each one to see what it was doing. Result - getting rid of needless processing got minutes down to seconds. No profiler-like processing would have told what the problems were. (Sorry to flame - everyone should know how to do this, and do it.) – Mike Dunlavey Jan 10 '16 at 02:17
  • @HansMüller Hi Hans, I am curious to know how you gathered your program counter data as I am doing a project where we are trying to make the program counter of a processor more robust. How did you get the hex values of the PC from running? – David777 Sep 08 '22 at 14:08

1 Answers1

0

Hans Müller, as I understand, gprof format is not so easy to generate (https://sourceware.org/binutils/docs/gprof/File-Format.html - "new file format is defined in header file gmon_out.h. It consists of a header containing the magic cookie and a version number, ... Histogram records consist of a header that is followed by an array of bins...")

I can recommend generation of

Easier way (for flat) is to use some awk/perl/python scripting and addr2line tool from binutils (you need addr2line with support of target architecture). This tool will give you function names from address (if you correctly map PC samples to virtual addresses of elf binary), and your script should sum samples for every function, then sort. It is harder to handle callgraph in small scripts.

gpreftools's pprof is just script which can run addr2line for you (you still need right variant of addr2line). It is capable of summing samples for functions and sorting, and even may call objdump to get annotated disassembly.

Both formats may be used for flat profiles, but they support callgraphs to some level. pprof allows you to save full backtrace in every event; while callgrind.out format stores only pairs caller-callee (cfn), and kcachegrind may incorrectly guess pathes to hot code.

osgx
  • 90,338
  • 53
  • 357
  • 513
  • PS: there is also https://github.com/jrfonseca/gprof2dot tool capable of converting many input formats to graphs of callgraphs. Some of input formats may be easier to generate – osgx Jan 09 '16 at 04:45
  • 1
    Thanks for your answer. I already wrote a Java program to convert my format to the gprof Format. The problem there is, that there are many different versions of gprof, with as you mentioned the new and the old version. The first link of you to gpredtool looks very interesting. I think I will also implement this format. – Hans Müller Jan 10 '16 at 08:21