C++ profiling and optimization

Question

I have some issues with performance of my application. I found this answer on Stackoverflow: https://stackoverflow.com/a/378024/5363

which I like. One bit I don't really understand is what is the relation between code optimization and profiling. Because obviously one wants to profile optimized code, but at the same time a lot of information is lost during optimizations. So is it practical to run optimized code in a debugger and break into it as suggested in the quoted answer?

I am using CMake with gcc under Linux, if this makes any difference.

Profile first, optimize after. What profiling does is help you identify "hot spots", where you can look into optimization. And the optimization you do is in the actual code you write, so no problem with debuggers, or further profiling. — Some programmer dude, Feb 25 '13 at 10:30
@JoachimPileborg, but how can I know that the bottleneck is not caused by a library written with the assumption that the code will be optimized? E.g. from my earlier experience STL behaves horribly in Debug builds. In this particular case the bottleneck seem to be some matrix operations (using Eigen library) — Grzenio, Feb 25 '13 at 10:34
You just need debug info, not to compile a debug build. You should profile with optimisations on. — JasonD, Feb 25 '13 at 10:40
If applicable, run first your program with `time`. The ratio of user/system/real should give an indication if your program spends most of the time in system calls or waiting I/O to finish. — Aki Suihkonen, Feb 25 '13 at 12:03

score 5 · Answer 1 · edited May 23 '17 at 10:30

5

The general Law is called the Law of Pareto, the law of 80/20:

20% of the causes produce 80% of the consequences.

By profiling, you are going to indentify the 20% of the most important causes that makes your application slow/consuming memory, or other consequences. And if you fix the 20% causes, you'll tackle 80% of the slowliness/memory consumption etc...

Of course the figures are just figures. Just to give you the spirit of it:

You have to focuss only on the real main causes so as to improve the optimization until you're satisfied.

Technically, with gcc under linux, an answer to the question you refering to " How can I profile C++ code running in Linux? " suggests to use, in a nutshell :

edited May 23 '17 at 10:30

Community

1
1

answered Feb 25 '13 at 10:39

Stephane Rolland

38,876
35
121
169

1

Also known as "lowest hanging fruit", i.e. getting the biggest rewards for the smallest effort. – Peter Wood Feb 25 '13 at 10:43
I agree 100%! How do I do it from technical perspective though? – Grzenio Feb 25 '13 at 10:43
@PeterWood thx, I didn't knat that one, the metaphor is really good :-) – Stephane Rolland Feb 25 '13 at 10:45

score 1 · Answer 2 · answered Feb 25 '13 at 10:47

If you need to collect stack samples, why do it through a debugger. Run pstack at regular time intervals. You can redirect the output to a different file for each run and analyze those files later. By looking at the call stack of these files, you may figure out the hot function. You do not need a debug binary and can do above on a fully optimized binary.

I would prefer using a profiler tool to doing the above or doing what is listed in the thread that you refer to. They quickly pinpoint the top hot functions and you can understand the call stack by looking at the caller callee graph. I would spend time understanding the caller callee stack rather than analyze random stacks using the above method.

score 1 · Answer 3 · edited May 23 '17 at 12:17

As Schumi said, you can use something like pstack to get stack samples. However, what you really need to know is why the program is spending the instant of time when the sample was taken. Maybe you can figure that out from only a stack of function names. It's better if you can also see the lines of code where the calls occurred. It's better still if you can see the argument values and data context. The reason is, contrary to popular conceptions that you are looking for "hot spots", "slow methods", "bottlenecks" - i.e. a measurement-based perspective, the most valuable thing to look for is things being done that could be eliminated.

In other words, when you halt the program in the debugger, consider whatever it is doing as if it were a bug. Try to find a way not to do that thing. However, resist doing this until you take another sample and see it doing the same thing - however you describe that thing. Now you know it's taking significant time. How much time? It doesn't matter - you'll find out after you fix it. You do know that it's a lot. The fewer samples you had to take before seeing it twice, the bigger it is.

Then there's a "magnification effect". After you fix that "speed bug" the program will take a lot less time - but - that wasn't the only one. There are others, and now they take a larger fraction of the time. So do it all again. By the time you finish this, if the program is any bigger than a toy, you could be amazed at how much faster it is. Here's a 43x speedup. Here's a 730x speedup. Here's the dreary math behind it.

You see, the problem with tools is you're paying a price for that ease of sampling. Since you're thinking of it as measurement, you're not concentrating on the reasons why the code is doing what it's doing - dubious reasons. That causes you to miss opportunities to make the code faster, causing you to miss the magnification effect, causing you to stop far short of your ultimate possible speedup.

EDIT: Apologies for the flame. Now to answer your question - I do not turn on compiler optimization until the very end, because it can mask bigger problems. Then I try to do a build that has optimization turned on, but still has symbolic information so the debugger can get a reasonable stack trace and examine variables. When I've hit diminishing speedup returns, I can see how much difference the optimizer made just by measuring overall time - don't need a profiler for that.

Thanks for your answer! How can I make gcc keep the symbolic information in the optimized build? In the debug build I found that Eigen (linear algebra package) seems to take forever multiplying small vectors and matrices ([2x1]*[2x2][1x2]), but that should be almost immediate. That is why I want to see if optimized build helps. — Grzenio, Feb 26 '13 at 13:42
@Grzenio: 1) Sorry, I seldom do that with gcc and its linker, but if you spend a couple hours surfing its options, it should be possible. 2) For multiplying matrices, especially if they are small, I would test it by doing a jillion of them and grabbing stackshots. I've been surprised by what's going on in LAPACK (mostly checking arguments). I bet any optimization in there implicitly assumes big matrices. For small matrices, you're almost certain to run a lot faster with hand-coded routines. Any library exists to make it easy for you to write, and assumes you can tolerate some slowness. — Mike Dunlavey, Feb 26 '13 at 19:26
@Grzenio: For example, I'm not too proud to write routines like `void MxM2(double a[], double b[], double c[])` meaning C = A*B where A, B, and C are 2x2 matrices, and just unrolling all the code in there. That's hard for a compiler to beat, and you can even make it a macro or inline it. — Mike Dunlavey, Feb 26 '13 at 19:33
@Grzenio: Sorry to keep going. When you say that multiplication "seems to take forever" and "should be almost immediate" I assume you're aware that no matter how speedy you make it, it will still take 100% of whatever time it takes. When you've reduced the execution time from a minute to a millisecond and can't go any further, if the program is still spending 99% of its time multiplying matrices, that's OK. You know you've removed all the waste. — Mike Dunlavey, Feb 26 '13 at 19:42

C++ profiling and optimization

3 Answers3