1

On Linux, a process' (main thread's) last program-counter value is presented in /proc/$PID/stat. This seems to be a really simple and easy way to do some sampled profiling without having to instrument a program in any way whatsoever.

I'm wondering if this has any caveats when it comes to the sampling quality, however. I'm assuming this value is updated whenever the process runs out of its timeslice, which should happen at completely random intervals in the program code, and that samples taken at more than time-slice length should be uniformly randomly distributed according to where the program actually spends its time. But that's just an assumption, and I realize it could be wrong in any number of ways.

Does anyone know?

Dolda2000
  • 25,216
  • 4
  • 51
  • 92

2 Answers2

1

Why not to try modern builtin linux tools like perf (https://perf.wiki.kernel.org/index.php/Main_Page)?

It has record mode with adjustable frequency (-F100 for 100 Hz), with many events, for example, on software event task-clock without using of hardware performance counters (stop the perf with Ctrl-C or add sleep 10 to the right to sample for 10 seconds):

 perf record -p $PID -e task-clock -o perf.output.file

Perf works for all threads without any instrumenting (recompilation or code editing) and will not interfere with program execution (only timer interrupt is slightly modified). (There is also some support of stacktrace sampling with -g option.)

Output can be parsed offline with perf report (only this command will try to parse binary and shared libraries)

 perf report -i perf.output.file

or converted to raw PC (EIP) samples with perf script -i perf.output.file.

PS: EIP pointer in /proc/$pid/stat file is mentioned in official linux man page 5 proc http://man7.org/linux/man-pages/man5/proc.5.html as kstkeip - "The current EIP (instruction pointer)." It is read at fs/proc/array.c:do_task_stat eip = KSTK_EIP(task);, but I'm not sure where and when it is filled. It can be written on task switch (both on involuntary when taskslice ends and voluntary when tasks does something like sched_yield) or on blocking syscalls, so it is probably not the best choice as sampling source.

osgx
  • 90,338
  • 53
  • 357
  • 513
  • I did in fact discover `perf record` myself and used it to some success (as mentioned in the comments to Mike's answer), and I also built a simple C program that uses `libunwind` to very quickly get a full stacktrace from a running process. That being said, though, I don't quite understand on what basis you conclude that the `/proc/*/stat` information is suboptimal. – Dolda2000 Feb 17 '17 at 23:34
  • I have no exact idea when "ip" field of `/proc/*/stat` is updated; there can be high systematic bias to blocking syscalls. It is also limited in time resolution (around 100 updates per second), when `perf` by default may allow events as frequent as 2-4 kHz (good for short-running programs). What kind of access did you use with libunwind, ptrace? – osgx Feb 18 '17 at 00:23
  • I guess there could be such a bias, but from what we know so far, I see nothing to particularly indicate that there is, or am I missing something? And yes, I did use ptrace access with libunwind. – Dolda2000 Feb 18 '17 at 03:36
  • `fs/proc/array.c`:`do_task_stat()` just prints out `eip = KSTK_EIP(task)`, which is read with `KSTK_EIP(task) (task_pt_regs(task)->ip)`. It is `struct pt_regs` on task's thread_info stack, which is the kernel-mode information about user thread. When thread is running on CPU, nobody will update some kernel stuff in memory, CPU just fetches EIP, next EIP, next EIP without writing real EIP to RAM. I think, the field is updated by some entry.s on syscalls and/or on interrupts or by `switch_to` - http://lxr.free-electrons.com/source/arch/x86/include/asm/switch_to.h?v=4.4#L31 – osgx Feb 18 '17 at 03:50
-1

If it works, which it could, it will have the shortcomings of prof, which gprof was supposed to remedy. Then gprof has its own shortcomings, which have led to numerous more modern profilers. Some of us consider this to be the most effective, and it can be accomplished with a tool as simple as pstack or lsstack.

Community
  • 1
  • 1
Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
  • 1
    Oh, I very much agree that manual sampling with a debugger is often the nicest, but my problem is that I'm trying to profile a soft-realtime program with LLVM-JITted code, which causes gdb to take ~1 second to attach to it (to read the JIT-generated symbol tables), which breaks the real-time constraints. Therefore, I'm trying to find different ways to sample it. :) – Dolda2000 Feb 14 '17 at 22:20
  • @Dolda2000: Thoughts: 1) Can you run the whole thing under a debugger, so pay the symbol-table cost once up front? 2) Can you run it flat-out, not waiting for outside events? - if there is any wastage hiding in there, this should expose it. 3) In my experience, it sort of doesn't matter how much time it takes *after* I've interrupted it. I can take all day to study it. What matters is that the interrupt itself happens at a time that, to the program, is unpredictable, so the wastage will be seen, to the degree that it takes time. – Mike Dunlavey Feb 14 '17 at 22:29
  • All else aside, I'd just like to say that I'd be interested in an actual answer to the question, as that'd help me decide when and what situations the EIP value from `/proc/*/stat` may be useful. But yes, as for 1) and 3), the problem then is that the time for me to interrupt the program, print the stacktrace and resume it also takes about a second (the advantage of calling `gdb` anew is that I can give it `-batch -ex` options). If there's a way to remedy that, that too would be highly interesting. As for 2), I need to run the program "in production" for the results to be meaningful. – Dolda2000 Feb 14 '17 at 22:34
  • @Dolda2000: Then if you need to run it "in production" I would try to get stack samples as it is running. Remember - If what you're doing is hunting for wastage, rather than just timing things, it's OK to slow it down. You get it going and then bing! grab a sample. Then get it going again and bing! grab another. If something is wasting 40% of time, 4 out of 10 samples will show it, on average. Sure you're interfering with it, but that's a nominal price for finding the wastage. – Mike Dunlavey Feb 14 '17 at 22:52
  • 1
    The problem with that is that I'm not just interfering with the program, but also its users, and I don't want to do that. – Dolda2000 Feb 14 '17 at 22:54
  • I would say - expect minor interruptions while we perform maintenance to improve performance. If that's not acceptable, setup a testbed with artificial traffic. – Mike Dunlavey Feb 14 '17 at 22:56
  • I have in fact tried several times in the past to set up artificial situations, but they've always been disappointing, and don't capture the performance problems that occur in reality. – Dolda2000 Feb 14 '17 at 22:58
  • Well, sampling program counters won't do it. If you're boxed-in that much, the next thing that comes to mind is timeline logging. That's a fair amount of effort, but it works. – Mike Dunlavey Feb 15 '17 at 01:00
  • "Won't do it" seems like a bit of an exaggeration. While I'm not *expecting* results from it, it's not like I haven't used similar methods for good results in the past. I'm certainly also considering other options, though, like writing a program that uses `ptrace` directly to extract a raw stack trace which can then be translated into symbols once the program is allowed to run again. Either way, whether it turns out to be useful or not, I'd still be quite interested to know the characteristics of the PC values in `/proc/*/stat`. – Dolda2000 Feb 15 '17 at 01:11
  • @Dolda2000: Try it. You will see that the program counter is typically in some low-level vanilla routine like waiting for nameless I/O, hanging on some mutex you have no idea why, allocating or freeing some memory you have no idea what, looking up some name in some table for heaven knows what reason. Satisfy your curiosity, but that's why stacks are more useful. – Mike Dunlavey Feb 15 '17 at 16:03
  • 1
    In actual fact, I since discovered `perf record`, which also does PC-based sampling, and used that to optimize two specific algorithmic functions, which sped the program as a whole up by a factor of 1.5-2. I'm not disagreeing that stacks are generally more useful, but it goes to show that PC-based sampling *can* be useful as well. – Dolda2000 Feb 15 '17 at 17:16
  • @Dolda2000: Can't argue with success :) It's just that things higher up the stack are no less likely to be fruitful for optimizing than code at the bottom of the stack. – Mike Dunlavey Feb 15 '17 at 17:33
  • Certainly not arguing against that being the more useful, but one method need perhaps not exclude another. :) – Dolda2000 Feb 15 '17 at 17:52