Profiling a (possibly I/O-bound) process to reduce latency

Question

I want to improve the performance of a specific method inside a larger application.

The goal is improving latency (wall-clock time spent in a specific function), not (neccessarily) system load.

Requirements:

As I expect a lot of the latency to be due to I/O, take into account time spent waiting/blocked (in other words: look at wall clock time instead of CPU time)
As the program does much more than the fragment i'm trying to optimize. There needs to be a way to either start/stop profiling programmatically, or to filter the output to only show the time between entering and exiting the function i'm optimizing.
Profiling on the method level is acceptable (if it can be done on the instruction level, even better. if it only profiles system calls, that's probably not enough)
This is for a hobby project, so expensive tools are not really an option
Instrumentation (-finstrument-functions) is acceptable
The critical piece of code I'm interested in is hard to interrupt manually (because it is already relatively fast and hard to realistically invoke in a loop), so some kind of automation is necessary.

Tools discarded so far:

gprof, oprofile, callgrind (requirement 1)
buiding something custom using getrusage (requirement 1)
poormansprofiler.org (requirement 2)
strace -T,dtrace,http://perf.wiki.kernel.org (requirements 2 and 3)
VTune,Zoom (requirement 4)
manual call-stack sampling (requirement 6)
google-perftools (should be able to measure wall time, but this does not appear to work in my case, presumably because SIGALRM interference.
systemtap (my kernel isn't patched to include utrace)

Other options which I haven't further evaluated yet:

cprof (doesn't build here out-of-the-box, seems i386-only)
manually inserting trace points (e.g. with lttng)

I'd love to hear about:

other options
perhaps I discarded some tool too soon?
whether or not the options I haven't evaluated yet have a chance of working, and if so, how to best do it.

I finally settled for:

building something custom using -finstrument-functions myself, based on http://balau82.wordpress.com/2010/10/06/trace-and-profile-function-calls-with-gcc/

The trace produced by this crude tool is hard to interpret, and I can easily imagine some tools for further processing its output making it infinitely more useful. However, this did the job for me for now, so I'm putting that project off until later ;).

score 3 · Accepted Answer · edited May 23 '17 at 12:10

Use this method.

It is quite simple and effective at pinpointing opportunities for optimization, whether they are in CPU or IO bound code.

If you are right that the biggest opportunities are in a particular function or module, then it will find them. If they are elsewhere, it will find them.

Of the tools you mentioned and discarded, it is most similar to poor man's profiler, but still not very similar.

EDIT: Since you say it is triggered by a user interaction and blocks further input until it completes, here's how I would do it.

First, I assume it does not block a manual interrupt signal to the debugger, because otherwise you'd have no way to stop an infinite loop. Second, I would wrap a loop of 10, 100, or 1000 times around the routine in question, so it is doing it long enough to be manually interrupted.

Now, suppose it is spending some fraction of time doing I/O, like 50%. Then when you interrupt it, you have a 50% chance of catching it in the I/O. So if you catch it in the I/O, which the call stack will tell you, you can also see in great detail where the I/O is being requested from and why.

It will show you what's going on, which is almost certainly something surprising. If you see it doing something on as few as two (2) samples that you could find a way to eliminate, then you will get a considerable speedup. In fact, if you eliminate that activity, you don't know in advance how much time you will save, but on average you can expect to save fraction F = (s+1)/(n+2), where n is the total number of samples you took, and s is the number of samples that showed the activity. (Rule of Succession) Example, if you took 4 stack samples and saw the activity on 2 of them, on average it would save you F = 3/6 = 1/2, corresponding to a speedup factor of 1/(1-F) or 2.

Once you've done that, you can do it again and find something else to fix. The speedup factors multiply together like compound interest.

Then of course you remove the outer loop and "cash in" all the speedups you got.

If you are wondering how this differs from profiling, it is that by carefully examining each stack sample, and possibly related data, you can recognize activities that you could remove, where if all you've got is measurements, you are left trying to intuit what is going on. The actual amount of time you save is what it is, regardless of any measurements. The important thing is to find the problem. No matter how precisely a profiler might measure it, if you can't find it, you're not winning. These pages are full of people saying either they don't understand what their profiler is telling them, or it seems to be saying there is nothing to fix, which they are only too willing to accept. That's a case of rose-tinted glasses.

More on all that.

Indeed I came across that page while searching. You describe call-stack sampling. That would be a great solution to my problem. However, the critical piece of code I'm interested in is hard to interrupt manually. Do you know of a tool/technique that would help me do it automatically? — Arnout Engelen, Aug 02 '12 at 08:12
As for requirement 2, this is not confusing me, I do really want to optimize this method specifically. The method invocation is triggered by a user interaction, and further UI interaction is blocked until it finishes. Other parts of the program are probably 'heavier'/'slower', but the user isn't so actively waiting for those anyway, so that's not much of a concern. — Arnout Engelen, Aug 02 '12 at 08:23
@ArnoutEngelen: What I do in a case like that is amplify it by wrapping a loop around it, like executing it 10 or 100 times when it would ordinarily be done once. Then I do the optimizations, and finally remove the loop. — Mike Dunlavey, Aug 02 '12 at 11:50
As I specified, wrapping a loop around the to-be-optimized segment is not easily doable in my case. Even though I feel your answer does not really answer my original question (as it ignores the 'no loop' requirement), it's still good information for other cases where this is feasible, so I'm accepting your answer. — Arnout Engelen, Aug 22 '12 at 20:46

score 3 · Answer 2 · answered Mar 30 '16 at 10:31

3

For I/O bound applications you can use the --collect-systime=yes option of callgrind.

This collects time spent in system calls (in milliseconds). So if you believe you have an I/O bottleneck, you can use these stats to identify it.

answered Mar 30 '16 at 10:31

Calmarius

18,570
18
110
157

That's great to know! Will have to dig into that :) – Arnout Engelen Mar 31 '16 at 06:59
1

If you use kcachegrind don't forget to change the type of cost to `sysCount` or `sysTime` – deepskyblue86 Apr 05 '18 at 16:59

score 1 · Answer 3 · answered Jul 18 '16 at 12:40

1

Todo: check out 'perf' (again)

answered Jul 18 '16 at 12:40

Arnout Engelen

6,709
1
25
36

Nah, as of 5.6.2 kernel perf [is not much of a use](https://github.com/Hi-Angel/notes-backup/blob/26593ebf8c0a007159ac211e81a3225cdb9c31fc/perf.md#measuring-time-that-threads-spend-sleeping-locked-etc) for profiling latency. – Hi-Angel Apr 17 '20 at 11:26

score 0 · Answer 4 · answered Aug 01 '12 at 21:57

0

fork()
execxxx(process under test)
in the parent:
- (in a loop) periodically call:
- getrusage(RUSAGE_CHILDREN, ...)

getrusage() will not only gives you the cpu usage, but also the major/minor pagefaults, number of context switches,etc. The rest of the time is probably spent waiting for I/O. This won't give you detailed profiling information but a nice overall footprint of the program's behavior, comparable to running vmstat on a per-process basis.

answered Aug 01 '12 at 21:57

wildplasser

43,142
8
66
109

It looks like this won't point me at places that do a lot of (possibly avoidable) I/O, and I'm specifically interested in that, right? – Arnout Engelen Aug 02 '12 at 07:57
The whole idea was to offer another way of looking at things. If plain profiling does not yield obvious hotspots, there is nothing more to do. (except for algorithmic changes, cache alignment, locality of reference, etc) – wildplasser Aug 02 '12 at 08:52
Well, for example I have not yet found a tool that would point me at the 'slow()' method of http://pastebin.com/W7ZS31yB - and neither would getrusage, as far as I can see. Obviously there's ways to catch that (the most promising ones I see so far are instrumentation and stack sampling), but I haven't found free tools that do that yet.. – Arnout Engelen Aug 02 '12 at 08:59
BTW: I was wrong, RUSAGE_CHILDREN only reports for children that have exited. If you'd want intermediate values, you'd have to use getrusage from the process itself (probably using a timer-driven signal) and send them to an external collector process (using a pipe or UDP socket) That would be a lot of instrumentation. And you'd still be lacking the current stacktrace at the moment of sampling. (IIRC there is a trick for that) BTW: signal-delivery is more likely when the program is blocked inside a system call, so that would be a win. – wildplasser Aug 02 '12 at 10:08

Profiling a (possibly I/O-bound) process to reduce latency

4 Answers4

Linked