What's the goal?
Just getting time measurements you can put on a powerpoint? or...
Finding out how to make the whole thing take less time? (Other than just running it on a faster chip.)
If the goal is (2), then the thing to do is find activities within the software that a) account for a large percent of wall-clock time, and b) aren't strictly necessary.
The reason is if you can get rid of an activity taking fraction X (like 50%) of time, then the speedup factor you get is up to 1/(1-X) or two times.
I'm being careful to use the word "activity" here, because it's a very general concept.
If you only think you're looking for "slow routines", you're going to miss big speedup opportunities, and that's what you cannot afford to do, if you actually care about performance.
The key point is that speedup opportunities are like rocks. They come in multiples, and in a range of sizes. If you don't remove every one of them you're going to be living with the ones you didn't get.
For example, if there are three of them, and when removed they save 50%, 25%, and 12.5%, then if you do all three you get a speedup of 8x. Pretty good.
But, if you miss a single one of them, you don't get anywhere near that.
Profilers are supposed to be rock-finders, but if they miss one, how are you going to know?
If the output of the profiler is impressive-looking, but doesn't seem to suggest much you could actually fix, does that mean there is none?
Nope.
More on all that.