2

I only have some rough idea about this, so I would like to have some more practicle ideas. Ideas for Linux, Unix, and Windows are all welcome.

The rough Idea in my head is:

The profiler setup some type of timer and a timer interrupt handler in the target process. When its handler takes control, it reads and saves the value of the instruction pointer register. When the sampling is done, it counts the occurences of every IP register value, then we can know the 'top hitters' among all sampled programe addresses.

But I do not actually know how to do it. Can someone give me some basic but practicle ideas of it? For example, what kind of timer (or equivalent) is always used? How to read the IP reg value? and etc. (I think when the execution enters the profiler's handler routine, the IP should be pointing the entrence of the handler, not to somewhere in the target program, so we cannot simplu read the current IP value)

Thank you for your answer!


Thanks for the answers from Peter Cordes and Mike Dunlavey.

Peter's answer tells how to read registers and memory of other process. Now I realized that the profiler does not have to execute 'inside' the target process, instead, it just reads the target's reg/mem using ptrace(2) from outside. It even does not have to suspend the target as the ptrace would do it anyway.

Mike's answer suggests that, for performance profiling, counting the occurrences of stack trace makes more sense than counting aginst the IP register values, as the latter may give too much noise information when the execution is in system module at the moment of sampling.

Thank you guys so much!

Zhou
  • 633
  • 5
  • 16
  • 2
    Related: [How does gdb read the register values of a program / process it's debugging? How are registers associated with a process?](https://stackoverflow.com/questions/48785758/how-does-gdb-read-the-register-values-of-a-program-process-its-debugging-how/48798085#48798085). So you could automate Mike Dunlavey's favourite profiling technique, of [stopping with a debugger](https://stackoverflow.com/questions/375913/how-can-i-profile-c-code-running-in-linux/378024) a few times, which is basically what you're imagining. – Peter Cordes Mar 09 '18 at 06:18
  • 2
    Some profilers build extra instrumentation into the process itself (like `gcc -pg`), to log the time and call-stack on entry to every function. But performance-counter events are different. Linux has a `perf` API separate from `ptrace` for profiling other processes. See http://www.brendangregg.com/perf.html for some examples. IDK exactly what VTune on Windows uses; I think it has its own kernel module / driver to give it access to the HW performance counters and collect the interrupts generated by HW performance counters wrapping around (creating a sample). – Peter Cordes Mar 09 '18 at 06:21

1 Answers1

4

Good for you for wanting to do this. Advice - don't try to mimic gprof.

What you need to do is sample the call stack, not just the IP, at random or pseudo-random times.

  • First reason - I/O and system calls can be deeply buried in the app and be costing a large fraction of the time, during which the IP is meaningless but the stack is meaningful. ("CPU profilers" simply shut their eyes.)

  • Second reason - Looking at the IP is like trying to understand a horse by looking at the hairs on its tail. To analyze performance of a program you need to know why the time is spent, not just that it is. The stack tells why.

Another problem with gprof is it made people think you need lots of samples - the more the better - for statistical precision. But that assumes you're looking for needles in a haystack, the removal of which saves next to nothing - in other words you assume (attaboy/girl programmer) there's nothing big in there, like a cow under the hay. Well, I've never seen software that didn't have cows in the hay, and it doesn't take a lot of samples to find them.

How to get samples: having a timer interrupt and reading the stack (in binary) is just a technical problem. I figured out how to do it a long time ago. So can you. Every debugger does it. But to turn it into code names and locations requires a map file or something like it, which usually means a debug build (not optimized). You can get a map file from optimized code, but the optimizer has scrambled the code so it's hard to make sense of.

Is it worthwhile taking samples in non-optimized code? I think so, because there are two kinds of speedups, the ones the compiler can do, and the ones you can do but the compiler can't. The latter are the cows. So what I and many other programmers do first is performance tuning on un-optimized code using random sampling. When all the cows are out, turn on the optimizer and let the compiler do its magic.

Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
  • Thank you so much for your informative answer. I believe I have to read it more times later, and try to find a way to catch the timer interrupt. I totally agree that counting the occurrences of particular stack trace makes more sense and gives less noise. But by examining the IP together with stack trace, we could find out exact which statement in the user's source code costs most of the CPU time. I guess with debug infomation, we could tell statements or function calls in user's source code from those in system modules, that's just a guess though. Thank you again. – Zhou Mar 09 '18 at 16:45
  • @dʒəu: The IP is included in the stack trace. And you say "we could find out exact which statement in the user's source code costs most of the CPU time". If the *statement in the user's source code that costs the most time* is a function call, like string compare, *new*, moving a UI control, *print*, or anything else, *the IP will not be there*, but that line of code IS on the stack sample where you can see it. If that line costs 90% of the time, it is on the stack 90% of the time, where you can see it. *That's the whole point.* – Mike Dunlavey Mar 09 '18 at 19:55
  • 2
    I have to disagree about profiling un-optimized code. That just seems pointless when you can build with `gcc -march=native -O3 -g -fno-omit-frame-pointer` to get optimized code with debug symbols, and frame pointers so you can use simple frame-pointer based stack unwinding instead of the complicated `.eh_frame` stuff. If you profile un-optimized builds, especially of C++, you're going to waste your time optimizing for `-O0` once you find a hotspot and start working on it. [That is pointless](https://stackoverflow.com/questions/49189685/). – Peter Cordes Mar 10 '18 at 02:53
  • @PeterCordes: That's good information about getting debug info in an optimized build. Thanks. You focussed on an authentic "hot spot", which happens sometimes. The (very common) kind I'm talking about is where halfway up the stack (where optimization achieves nothing) 1 out of 17 print statements, or `new` calls, or system calls to get internationalized strings, etc., is on 9/10 of stack samples, so it's costing ~90% of time, but you can't tell which one it is because the optimizer took a mix-master to the code. No meaningful line numbers. Cheers. – Mike Dunlavey Mar 10 '18 at 15:59
  • Flame graphs are good for visualization of stack traces when you for any reason have more data than you can hold in your head. – Thorbjørn Ravn Andersen Feb 02 '23 at 14:59
  • @ThorbjørnRavnAndersen: Hope you're OK. It's been a long time. I have a gripe with flame graphs - they are more pretty than effective. Not only are they missing line-of-code info, but it is entirely too easy for big problems to hide in them. [*More.*](https://stackoverflow.com/a/27867426/23771) Also [*here.*](https://stackoverflow.com/a/25870103/23771) – Mike Dunlavey Feb 02 '23 at 20:39
  • @MikeDunlavey I had a look at the links seeing that the defaults for flame graphs did not show the hot paths, and expected icicle graphs to show them instead. They didn't. That was interesting, as tabular data is not scaling well in general (many samples, many threads), so what you are pointing out in my opinion, is that a better visualization is required to see these things easy and intuitively. – Thorbjørn Ravn Andersen Feb 12 '23 at 22:18
  • @ThorbjørnRavnAndersen: I had two ways of dealing with multi-thread issues, both effective but not easy. One was when I took a sample, to sample all threads, to see which threads were waiting for others and why. The other was to construct a timeline on paper of asynchronous events, such as what event caused a database update to be waited for, and figuring out which opportunities for parallelism were not being exploited. It would take all morning to do this, but it was worth it because it worked. If there were a tool to help, that would be nice, but I'm not holding my breath. – Mike Dunlavey Feb 13 '23 at 19:41