4

I have a native C++ application which performs heavy calculations and consumes a lot of memory. My goal is to optimize it, mainly reduce its run time.
After several cycles of profiling-optimizing, I tried the Profile Guided Optimization which I never tried before.

I followed the steps described on MSDN Profile-Guided Optimizations, changing the compilation (/GL) and linking (/LTCG) flags. After adding /GENPROFILE, I ran the application to create .pgc and .pdg files, then changed the linker options to /USEPROFILE and watched additional linker messages that reported that the profiling data was used:

3>  0 of 0 ( 0.0%) original invalid call sites were matched.
3>  0 new call sites were added.
3>  116 of 27096 (  0.43%) profiled functions will be compiled for speed, and the rest of the functions will be compiled for size
3>  63583 of 345025 inline instances were from dead/cold paths
3>  27096 of 27096 functions (100.0%) were optimized using profile data
3>  608324578581 of 608324578581 instructions (100.0%) were optimized using profile data
3>  Finished generating code

Everything looked promising, until I measured the program's performance.

The results were absolutely counterintuitive for me

  • Performance went down instead of up! 4% to 5% slower than without using Profile Guided Optimization (when comparing with/without the /USEPROFILE option).

  • Even when running the exact same scenario that was used with /GENPROFILE to create the Profile Guided Optimization data files, it ran 4% slower.

What is going on?

Community
  • 1
  • 1
Amir Gonnen
  • 3,525
  • 4
  • 32
  • 61
  • It would be really nice if you could post a [mcve], for example of a function and data set that got slower after using PGO. – Cody Gray - on strike May 28 '17 at 10:40
  • @CodyGray I know. Unfortunately - I cannot. The problem happens on a large project (see the number of functions and instructions?) which is the property of the company I work for. Even if I could somehow isolate a smaller part of the program that shows the problem, I don't think I'm allowed legally to post it. – Amir Gonnen May 28 '17 at 12:11
  • Amir, "I ran the application to create pgc" - Did you run it not the same input as in timing run? How flat is your current program profile (from visual studio profiler), is there any function with large self time percents (more than 20%, than 5%)? Is there lot of external I/O (disk, network, swap) in the program? – osgx May 31 '17 at 18:33
  • @osgx When running with the exact same input as the timing run, it ran 4% slower (as I mentioned in the question). The program is deterministic (gives the exact same output for a given input) and most time is spent on graph traversal and large bitset operations. The function with highest exlusive samples is 7% (3 functions above 5%, 11 functions above 2%). The program performs very little IO, but consumes ~2Gb ram. The machine has enough ram, no swapping. – Amir Gonnen Jun 01 '17 at 07:22
  • Good link, I learn from it. Thanks, Amir. My result is 10% slower. :( – cppBeginner Feb 25 '19 at 08:53

2 Answers2

1

Looking at the sparse doc here the profiler doesn't seem to include any memory optimizations.

If your program takes 2GiB of memory, I'd speculate that the execution speed is limited by memory access and not by the CPU itself. (You also stated something about maps being used, these are also memory limited) Memory access is difficult to optimize for a profiler, cause it can't change your malloc calls to (for example) put frequently used data into the same pages or make sure they are moved to the same cache line of the CPU.

In addition to that the profiler may introduce additional memory accesses when trying to optimize the bare CPU performance of your program. The doc states "virtual call speculation", I would speculate that this (and maybe other features like inlining) could introduce additional memory traffic, thus degrading the overall performance cause memory bandwidth is already the limiting factor.

Johannes
  • 665
  • 5
  • 10
-5

Don't look at it as a black box. If the program can be speeded up, it's because it is doing things it doesn't need to do.

Those things will hide from the profile-guided or any other optimizer, and they will certainly hide from your guesser.

They won't hide from this. Many people use it.

I'm trying to resist the temptation to guess, but I'm failing. Here's what I see in big C++ apps, no matter how well-written they are. When people could use a simple data structure like an array, instead they use an abstract container class, with iterators and whatnot. Where does the time go? That's where it goes.

Another thing they do is write "powerful functions and methods". The writer of the function is so proud of it, that it does so much, that he/she expects it will be called reverently and sparingly.
The user of the function (which could be the same person) thinks "Look how useful this function is! See how much I can get done in a single line of code? The more I use it the more productive I will be."
See how this can easily do needless work?
There's another thing that happens in software - layers of abstraction. If the pattern above is repeated over several layers of abstraction, the slowdown factors multiply.
The good news is, if you can find those, and if you can fix them, you can get enormous speedup. The bad news is you could suffer as "not a team player".

Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
  • 3
    Your answer does not address my question: what could cause Visual Studio C++ profile guided optimization to give such bad results?? In my case, the application was written by a small team under my supervision so I have pretty good idea what data structures and algorithms are used. As I mentioned, I did several iterations of profile-optimize which were beneficial (replaced some std::map with std::unordered_map for example), I also identified hot spots were most time is wasted etc. But now after doing this, I'm looking for ways to improve it even further. – Amir Gonnen May 25 '17 at 18:39
  • @AmirGonnen: I'm trying to say if you want more performance, the method I would recommend is random pausing, not your profile-guided optimizer. [*This post*](https://softwareengineering.stackexchange.com/a/302345/2429) might explain further what I mean. – Mike Dunlavey May 25 '17 at 21:04
  • 1
    What is the advantage of random-pausing vs. traditional profiler? Visual Studio built in profiler can show the lines/functions that the application spends most time in very conveniently. – Amir Gonnen May 26 '17 at 08:37
  • 1
    @AmirGonnen: But that doesn't tell you *why* it's there, and if you don't know *why*, you don't know if you can get rid of it. Suppose a line of code is somewhere on the stack 30% of the time, so if you could remove it you would save 30%. How do you know if you can remove it? By examining the whole stack sample and seeing why it's being called. And 1000 stack samples don't tell you any more than 10 do. Less, in fact, because when you have 1000, you don't examine any. Check this [*short video*](https://www.youtube.com/watch?v=xPg3sRpdW1U). – Mike Dunlavey May 26 '17 at 13:41
  • Mike, PGO is not about random-stack sampling. It is about telling compiler what to optimize, compiler and computer time is much cheaper than people time, and it can be not constructive to always promote your own (archaic) methods to everyone of 4 mln people, to ask them keep doing random stack sampling, instead of using modern (some buzzword)-styled profiling and PGO. Modern profilers are not gprof of your times, then have both low overhead and function call stack information sampled, aggregated, and computed to almost-exact percents, not your 50-99% chances / or 10 stack which works for short – osgx May 31 '17 at 18:29
  • @osgx: [*This link*](https://stackoverflow.com/a/927773/23771) shows a factor 40+ speedup. [*This link*](https://softwareengineering.stackexchange.com/a/302345/2429) shows a factor 700+ speedup. *Have you seen anybody accomplish anything like that using whatever they consider their favorite profiler?* And PGO can only tweak things at the bottom of the call stack, when in most real applications all the slowness is due to badness higher up the stack. PGO cannot read minds. People look for the latest technology to fix their mistaken design decisions, and it can't. – Mike Dunlavey May 31 '17 at 23:59
  • Mike, has anybody write so slow program that it can be optimized 1000 times in 10 steps where each step is: take random stacktrace, rewrite part of code? Yes, probably some may write such program. Is stacktrace useful - yes. Should it be taken once (five times) with gdb or ten thousand times with perf? Should we forget the old gprof - yes. Should be not learn modern profilers and always use gdb - no. Should you always and anywhere promote your own method which needs tens of minutes of reading your long highly-cross-linked posts? Where helping people ends and spam begins? I think learning perf. – osgx Jun 01 '17 at 00:05
  • learning perf and modern tools is more useful that learning ancient manual stacktracing technique. Did you check any profiler, newer that gprof? How can you get low-level information from the CPU from its 100s of hardware performance counters with gdb only? perf allows you to access both stacks and hardware counters. Both in counting and in sampling mode. And there are simple-to-use and fast-to-learn tools to view perf profiles like https://github.com/jrfonseca/gprof2dot and http://www.brendangregg.com/flamegraphs.html. It, the perf profiler also can in kernel and off-cpu profiling. – osgx Jun 01 '17 at 00:09
  • @osgx: I'm only trying to be helpful. Yes, factors like that are quite reasonable. Just recently, for example, I analyzed a fairly large app. Every one of the pauses landed deep in container classes. Why? Because it was fetching numbers it had saved in there under names. Why? Because it was trying to be general. Why was it getting the numbers? Because it was building a model *and doing it the same exact way thousands of times*. This case calls out for code generation, and if possible run-time compilation. – Mike Dunlavey Jun 01 '17 at 00:12
  • @Mike, So, not all of your task checked for performance are from 1993 year problems? Also: "They won't hide from this." is the word this used to mean Brain and thought? – osgx Jun 01 '17 at 00:15
  • @osgx: I'm sure (or hope) *perf* is able to collect stack samples on wall-clock time. (Wall-clock time means you don't need to distinguish on/off-CPU time). One you have the samples, then what? Every profiler I've seen makes various kinds of summaries - call graphs, hot paths, flame graphs, and so on, and it's [*easy for speedups to hide from all of those*](https://stackoverflow.com/a/25870103/23771). And the problems that are hiding are the ones that limit speedup. But if you can simply study some of the stack samples, you trade precision of timing for precision of insight, and you find them. – Mike Dunlavey Jun 01 '17 at 00:20
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/145588/discussion-between-osgx-and-mike-dunlavey). – osgx Jun 01 '17 at 00:21
  • @osgx: The large app I analyzed is currently running with many users, currently being maintained by a staff. And if a problem is hiding from your brain and thought, how would anyone know? That's a "false negative" - where you say nothing's there, and it is. – Mike Dunlavey Jun 01 '17 at 00:23