Removing code portions does not match the profiler's data

Question

I am doing a little proof of concept profile and optimize type of example. However, I ran into something that I can't quite explain, and I'm hoping that someone here can clear this up.

I wrote a very short snippet of code:

int main (void)
{
    for (int j = 0; j < 1000; j++)
    {
        a = 1;
        b = 2;
        c = 3;

        for (int i = 0; i < 100000; i++)
        {
            callbackWasterOne();
            callbackWasterTwo();
        }
        printf("Two a: %d, b: %d, c: %d.", a, b, c);
    }
    return 0;
}

void callbackWasterOne(void)
{
    a = b * c;
}
void callbackWasterTwo(void)
{
    b = a * c;
}

All it does is call two very basic functions that just multiply numbers together. Since the code is identical, I expect the profiler (oprofile) to return roughly the same number.

I run this code 10 times per profile, and I got the following values for how long time is spent on each function:

main: average = 5.60%, stdev = 0.10%
callbackWasterOne = 43.78%, stdev = 1.04%
callbackWasterTwo = 50.24%, stdev = 0.98%
rest is in miscellaneous things like printf and no-vmlinux

The difference between the time for callbackWasterOne and callbackWasterTwo is significant enough (to me at least) given that they have the same code, that I switched their order in my code and reran the profiler with the following results now:

main: average = 5.45%, stdev = 0.40%
callbackWasterOne = 50.69%, stdev = 0.49%
callbackWasterTwo = 43.54%, stdev = 0.18%
rest is in miscellaneous things like printf and no-vmlinux

So evidently the profiler samples one more than the other based on the execution order. Not good. Disregarding this, I decided to see the effects of removing some code and I got this for execution times (averages):

Nothing removed: 0.5295s
call to callbackWasterOne() removed from for loop: 0.2075s
call to callbackWasterTwo() removed from for loop: 0.2042s
remove both calls from for loop: 0.1903s
remove both calls and the for loop: 0.0025s
remove contents of callbackWasterOne: 0.379s
remove contents of callbackWasterTwo: 0.378s
remove contents of both: 0.382s

So here is what I'm having trouble understanding:

When I remove just one of the calls from the for loop, the execution time drops by ~60%, which is greater than the time spent by that one function + the main in the first place! How is this possible?
why is the effect of removing both calls from the loop so little compared to removing just one? I can't figure out this non-linearity. I understand that the for loop is expensive, but in that case (if most of the remaining time can be attributed to the for loop that performs the function calls), why would removing one of the calls cause such a large improvement in the first place?

I looked at the disassembly and the two functions are the same in code. The calls to them are the same, and removing the call simply deletes the one call line.

Other info that might be relevant

I'm using Ubuntu 14.04LTS
The code is complied by Eclipse with no optimization (O0)
I time the code by running it in terminal using "time"
I use OProfile with count = 10000 and 10 repetitions.

Here are the results from when I do this with -O1 optimization:

main: avg = 5.89%, stdev = 0.14%
callbackWasterOne: avg = 44.28%, stdev = 2.64%
callbackWasterTwo: avg = 49.66%, stdev = 2.54% (greater than before)
Rest is miscellaneous

Results of removing various bits (execution time averages):

Nothing removed: 0.522s
Remove callbackWasterOne call: 0.149s (71.47% decrease)
Remove callbackWasterTwo call: 0.123% (76.45% decrease)
Remove both calls: 0.0365s (93.01% decrease) (what I would expect given the profile data just above)

So removing one call now is much better than before, and removing both still carries a benefit (probably because the optimizer understands that nothing happens in the loop). Still, removing one is much more beneficial than I would have anticipated.

Results of the two functions using different variables: I defined 3 more variables for callbackWasterTwo() to use instead of reusing same ones. Now the results are what I would have expected.

main: avg = 10.87%, stdev = 0.19% (average is greater, but maybe due to those new variables)
callbackWasterOne: avg = 46.08%, stdev = 0.53%
callbackWasterTwo: avg = 42.82%, stdev = 0.62%
Rest is miscellaneous

Results of removing various bits (execution time averages):

Nothing removed: 0.520s
Remove callbackWasterOne call: 0.292s (43.83% decrease)
Remove callbackWasterTwo call: 0.291% (44.07% decrease)
Remove both calls: 0.065s (87.55% decrease)

So now removing both calls is pretty much equivalent (within stdev) to removing one call + the other. Since the result of removing either function is pretty much the same (43.83% vs 44.07%), I am going to go out on a limb and say that perhaps the profiler data (46% vs 42%) is still skewed. Perhaps it is the way it samples (going to vary the counter value next and see what happens).

It appears that the success of optimization relates pretty strongly to the code reuse fraction. The only way to achieve "exactly" (you know what I mean) the speedup noted by the profiler is to optimize on completely independent code. Anyway this is all interesting.

I am still looking for some explanation ideas for the 70% decrease in the -O1 case though...

I did this with 10 functions (different formulas in each, but using some combination of 6 different variables, 3 at a time, all multiplication):

These results are disappointing to say the least. I know the functions are identical, and yet, the profiler indicates that some take significantly longer. No matter which one I remove ("fast" or "slow" one), the results are the same ;) So this leaves me to wonder, how many people are incorrectly relying on the profiler to indicate the incorrect areas of code to fix? If I unknowingly saw these results, what could possible tell me to go fix the 5% function rather than the 20% (even though they are exactly the same)? What if the 5% one was much easier to fix, with a large potential benefit? And of course, this profiler might just not be very good, but it is popular! People use it!

Here is a screenshot. I don't feel like typing it in again: Opcontrol

My conclusion: I am overall rather disappointed with Oprofile. I decided to try out callgrind (valgrind) through command line on the same function and it gave me far more reasonable results. In fact, the results were very reasonable (all functions spent ~ the same amount of time executing). I think Callgrind samples far more than Oprofile ever did.

Callgrind will still not explain the difference in improvement when a function is removed, but at least it gives the correct baseline information...

Btw, *never* benchmark without optimizations. There are people who unconditionally (and understandably) downvote questions that benchmark without optimizations. — Mysticial, Dec 02 '14 at 22:07
I just wanted to understand what would happen. And if it was optimized I ran a higher chance of just getting confused because of things being optimized away. — Mewa, Dec 03 '14 at 01:01
@Mysticial, I have added the results when using `-O1`. Hopefully that is more useful. — Mewa, Dec 03 '14 at 18:05
1) I assume a, b, and c are global? 2) @Mysticial: I get annoyed when people take it as religion that you should only profile optimized code, 'cause that's what some kiddo told them. There's plenty more to say on that subject. — Mike Dunlavey, Dec 05 '14 at 01:25
@MikeDunlavey Apologies then. I don't usually say stuff like that without a reason. And the reason here is simply that in my experience, it is almost never meaningful to profile without optimizations. The sole exception I've ever seen on SO is [this](http://stackoverflow.com/questions/21842402/gcc-o0-outperforming-o3-on-matrix-sizes-that-are-powers-of-2-matrix-transposi). But I digress... — Mysticial, Dec 05 '14 at 02:01
@Mysticial: Just a couple days ago, my teammates found/fixed a slowness bug. Time went from 144s to 5s. They used the VC profiler, plus a lot of "intuition". I pointed out to them a single pause in the debugger would have nailed it with 97% certainty, but it's hard to run the debugger on optimized code. So it comes down to one's presuppositions about what kinds of speed problems the code might have, small ones or big ones. — Mike Dunlavey, Dec 05 '14 at 13:29
Hi @MikeDunlavey, a, b, and c are indeed global. My current perception for why I observe such a strange difference between what I expect based on profiler results and what happens when I cut out the function has something to do with caching since the code is so small. I haven't had any luck tracking down the reason using cachegrind/callgrind, and at this point I'm just being stubborn (I think) because I really want to know what causes this. I installed Intel VTune in an attempt to get some insight into the real processor cache for this program, but it doesn't work yet :) — Mewa, Dec 05 '14 at 19:27
@MikeDunlavey, also I did the hand sampling, but the main idea at the beginning was to show my supervisor the difference (or lack of it) between normal profilers and doing it by hand. — Mewa, Dec 05 '14 at 19:30
@Mewa: I understand. Please forgive me for being a curmudgeon. I see people looking to precisely measure molecules, never imagining that they might have gobs of blubber they could slice away. Once you do that, then there's *more*, then *more*. We're talking *big* speedups, like [*here*](http://scicomp.stackexchange.com/a/1870/1262). This is common in big programs full of data management, etc. People don't know how to do that. If that's cleaned out, then you can sweat little stuff. — Mike Dunlavey, Dec 05 '14 at 20:41

score 2 · Answer 1 · answered Dec 02 '14 at 20:39

Ah, I see you did look at the assembly. This question is indeed interesting in its own right, but in general there's no point profiling unoptimized code, since there's so much boilerplate that could easily be reduced even in -O1.

If it's really only the call that's missing, then that could explain the timing differences -- there's lots of boilerplate from the -O0 stack manipulation code (any caller-saved registers have to be pushed onto the stack, and any arguments too, then afterwards the any return value has to be treated and the opposite stack manipulation has to be done) which contributes to the time it takes to call the functions, but is not necessarily completely attributed to the functions themselves by oprofile since that code is executed before/after the function is actually called.

I suspect the reason the second function seems to always take less time is that there's less (or no) stack juggling that needs to be done -- the parameter values are already on the stack thanks to the previous function call, and so, as you've seen, only the call to the function has to be executed, without any other extra work.

Hey thanks for your reply! I redid the same measurements with `-O1` and I'm updating my original post with that result. Some of it (removing both calls) now makes more sense, but some (removing one call) makes even less. :/ The boiler plate comment somewhat makes sense, but then even if the stack saving/removing is attributed to the main, wouldn't the execution speed be reduced by at most the amount taken up by the function itself + main? — Mewa, Dec 03 '14 at 17:55
With regards to the 2nd function taking less time... I was thinking it was due to caching or the value being on the stack or something like that, but then it would vary from the assembly, wouldn't it? I would think that O0 code would just stupidly resave values. That gives me the idea that I should try having different variables for the two functions (will do that now). — Mewa, Dec 03 '14 at 18:00
@Mewa: You know, I agree with both of your comments. I'm not sure why the timing is the way it is either; I think perhaps my answer may be wrong :-) — Cameron, Dec 03 '14 at 20:43

Removing code portions does not match the profiler's data

1 Answers1