2

Most of my limited experience with profiling native code is on a GPU rather than on a CPU, but I see some CPU profiling in my future...

Now, I've just read this blog post:

How profilers lie: The case of gprof and KCacheGrind

about how what profilers measure and what they show you, which is likely not what you expect if you're interested in discerning between different call paths and the time spent in them.

My question is: Is this still the case today (5 years later)? That is, do sampling profilers (i.e. those who don't slow execution down terribly) still behave the way gprof used to (or callgrind without --separate-callers=N)? Or do profilers nowadays customarily record the entire call stack when sampling?

einpoklum
  • 118,144
  • 57
  • 340
  • 684
  • Your question is a bit confusing because the second example in the linked question involves callgrind, which is absolutely _not_ a sampling profiler: it runs your process under a virtual machine and can record information about every instruction/basic block/function/whatever. In doing so, it slows everything down by an order of magnitude and so introduces its own set of problems and distortions. It is not a sampling profiler, however. – BeeOnRope Apr 16 '18 at 02:40

3 Answers3

3

No, many modern sampling profilers don't exhibit the problem described regarding gprof.

In fact, even when that was written, the specific problem was actually more a quirk of the way gprof uses a mix of instrumentation and sampling and then tries to reconstruct a hypothetical call graph based on limited caller/callee information and combine that with the sampled timing information.

Modern sampling profilers, such as perf, VTune, and various language-specific profilers to languages that don't compile to native code can capture the full call stack with each sample, which provides accurate times with respect to that issue. Alternately, you might sample without collecting call stacks (which reduces greatly the sampling cost) and then present the information without any caller/callee information which would still be accurate.

This was largely true even in the past, so I think it's fair to say that sampling profilers never, as a group, really exhibited that problem.

Of course, there are still various ways in which profilers can lie. For example, getting results accurate to the instruction level is a very tricky problem, given modern CPUs with 100s of instructions in flight at once, possibly across many functions, and complex performance models where instructions may have a very different in-context cost as compared to their nominal latency and throughput values. Even that tricky issues can be helped with "hardware assist" such as on recent x86 chips with PEBS support and later related features that help you pin-point an instruction in a less biased way.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
2

Regarding gprof, yes, it's still the case today. This is by design, to keep the profiling overhead small. From the up-to-date documentation:

Some of the figures in the call graph are estimates—for example, the children time values and all the time figures in caller and subroutine lines.

There is no direct information about these measurements in the profile data itself. Instead, gprof estimates them by making an assumption about your program that might or might not be true.

The assumption made is that the average time spent in each call to any function foo is not correlated with who called foo. If foo used 5 seconds in all, and 2/5 of the calls to foo came from a, then foo contributes 2 seconds to a’s children time, by assumption.

Regarding KCacheGrind, little has changed since the article was written. You can check out the change log and see that the latest version was published in April 5, 2013, which includes unrelated changes. You can also refer to Josef Weidendorfer's comments under the article (Josef is the author of KCacheGrind).

Hadi Brais
  • 22,259
  • 3
  • 54
  • 95
  • So, I +1'ed your answer because it, well, answers my question, but this is terrible. – einpoklum Apr 12 '18 at 21:05
  • @einpoklum You may want to check out [other profilers](https://en.wikipedia.org/wiki/List_of_performance_analysis_tools). In particular, AMD CodeXL by AMD and VTune Amplifier XE by Intel. There are also a number of tools from Microsoft. I've used these tools and mostly happy with them. – Hadi Brais Apr 12 '18 at 21:25
  • Are the examples you site better in the sense that I describe? Are they sampling profilers? – einpoklum Apr 12 '18 at 21:38
  • 1
    @einpoklum Yes they are. They all support sampling-based profiling. However, in some cases instrumentation may be used. I wrote a [series of articles](https://hadibrais.wordpress.com/2017/03/15/the-art-of-profiling-using-intel-vtune-amplifier-part-1/) on VTune and I'm planning to write a similar series on the Visual Studio built-in profiler in the near future. The thing here is that gprof and KCacheGrind are really old and are not being regularly maintained. But the ones I mentioned are still being maintained and they are much more advanced. That's why I use them. – Hadi Brais Apr 12 '18 at 21:45
  • It's worth noting that this particular flaw isn't _inherent_ to sampling profilers, but to samplers like `gprof` that record limited information about the call stack. In the case of `gprof`, you are actually looking a _hybrid_ sampler that records caller/callee information using instrumentation, but then records the PC periodically with sampling. It combines those into a type of fake call-stack output using the assumptions you highlighted above, which can be arbitrarily wrong. – BeeOnRope Apr 16 '18 at 02:35
  • 1
    However, this design isn't standard or even usual in most sampling profilers. A typical sampling profiling might capture the entire call stack on a sample (`perf` does, if you ask it to). This avoids this problem entirely. – BeeOnRope Apr 16 '18 at 02:36
0

If you noticed, I contributed several comments to that post you referenced, but it's not just that profilers give you bad information, it's that people fool themselves about what performance actually is.

What is your goal? Is it to A) find out how to make the program as fast as possible? Or is it to B) measure time taken by various functions, hoping that will lead to A? (Hint - it doesn't.) Here's a detailed list of the issues.

To illustrate: You could, for example, be calling a teeny innocent-looking little function somewhere that just happens to invoke nine yards of system code including reading a .dll to extract a string resource in order to internationalize it. This could be taking 50% of wall-clock time and therefore be on the stack 50% of wall-clock time. Would a "CPU-profiler" show it to you? No, because practically all of that 50% is doing I/O. Do you need many many stack samples to know to 3 decimal places exactly how much time it's taking? Of course not. If you only got 10 samples it would be on 5 of them, give or take. Once you know that teeny routine is a big problem, does that mean you're out of luck because somebody else wrote it? What if you knew what the string was that it was looking up? Does it really need to be internationalized, so much so that you're willing to pay a factor of two in slowness just for that? Do you see how useless measurements are when your real problem is to understand qualitatively what takes time?

I could go on and on with examples like this...

Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
  • To answer your counter-question: I want to understand how my code behaves and what kind of work is being done for how much of the time. With your example, a CPU profiler with sufficiently extensive information gathering would show me which system calls (and perhaps which common partial parameter combinations to those calls) take up a lot of time - and more than that, which lines of code make those system calls; and then I might figure out that I'm spending half my time on reading internationalized strings from files. Knowing that, I'll either call it a day, or decide that I want to do... – einpoklum Apr 12 '18 at 21:13
  • ... something about (like moving that work into a separate thread, or caching those strings somehow, etc.) – einpoklum Apr 12 '18 at 21:15
  • And "Do you need many many stack samples to know to 3 decimal places exactly how much time it's taking?" - you're misunderstanding. I need many stack samples to know up to the _first_ decimal points how much time different stack states are taking, since the state space is much larger. – einpoklum Apr 12 '18 at 21:17
  • @einpoklum: Possibly [*this link*](https://scicomp.stackexchange.com/a/2719/1262) will help to explain how it works statistically. – Mike Dunlavey Apr 13 '18 at 11:22