Optimizing a Loop vs Code Duplication

Question

My dilemma concerns how to best handle long heavy loops which can accept parameters. Consider the following method:

void HeavyLoop(byte* startingAddress, bool secondaryModification)
{
    for (int i = 0; i < 10000000; i++)
    {
        byte* b = startingAddress + i;
        *b+= 1;
        if (secondaryModification) *b+= 2;
    }
}

This method will do what I want, but I am using 10000000 unnecessary ifs inside the loop.

Had I written the same method like this:

void HeavyLoop(byte* startingAddress, bool secondaryModification)
{
    if (secondaryModification)
    {
        for (int i = 0; i < 10000000; i++)
        {
            byte* b = startingAddress + i;
            *b+= 1;
            *b+= 2;
        }
    }
    else
    {
        for (int i = 0; i < 10000000; i++)
        {
            byte* b = startingAddress + i;
            *b+= 1;         
        }
    }   
}

I would get the same result, though my entire loop code would have to be duplicated. This is not a big deal if we're talking about one parameter, but when you have 4 independent parameters I would have to write 16 different versions of the loop.

What is the "correct" solution in cases like this? If this were a language like Python I could just dynamically build a function to handle the loop. Is there something similar in C++?

Needless to say, the code is only an example and not the actual case. Please don't give solutions pertaining to *b+=1 per se. I am a C++ novice so forgive me if there is a simple solution I am not aware of. Also forgive me if there are syntax errors, I don't have a compiler handy at the moment.

Edit: The issue is dealing with statements that can not be pre-calculated outside of the loop.

I think you have to come up with a slightly more complex example, or else simplistic solutions like Kerrek's will be posted. — duedl0r, Jan 10 '12 at 15:21
Have you proven the compiler doesn't do this optimization itself? Hoisting invariants out of loops has been a standard optimization for a long time, and is easy to implement to boot. — , Jan 10 '12 at 15:23
"If this were a language like Python I could just dynamically build a function to handle the loop" - although the overhead of a call via a function pointer is almost certainly higher than the overhead of the `if`, so I very much doubt that you gain performance that way. But maybe I'm underestimating CPython, is it capable of inlining a function call *after* that function has been defined locally with a variable capture, and optimize it based on the value of the variable? — Steve Jessop, Jan 10 '12 at 15:27
Also make sure to only optimize if the loop in question really is the performance bottleneck in your application (see premature optimization, e.g. here: http://c2.com/cgi/wiki?PrematureOptimization) — codeling, Jan 10 '12 at 15:27
In your actual cases, do the `modification` flags mostly add extra statements to the loop body, or do they trigger completely alternative loop bodies? — Frerich Raabe, Jan 10 '12 at 15:28
@delnan: even without hoisting out of the loop, dynamic branch prediction by the CPU ought to do pretty well with this. So although I agree that it's pointless making this optimization if the compiler does it, beware it might well be pointless making this optimization even if the compiler doesn't. — Steve Jessop, Jan 10 '12 at 15:30
If this isn't your actual case, is it at least *representative* of the actual case? That is to say, is the increment amount constant during each function invocation, or do you need a way to look up the increment amount during each round of the loop? Don't oversimplify a question to where it loses its main point! — Kerrek SB, Jan 10 '12 at 15:30
@Steve: CPython won't, it's a dump interpreter. PyPy on the other hand has a tracing JIT compiler. Removing conditional jumps and inlining calls is the *least* tracing JITs do. And the way it is constructed, it doesn't care the slightest bit if the code was written by hand or created dynamically. There's [loop invariant code motion](http://morepypy.blogspot.com/2011/01/loop-invariant-code-motion.html) too, so it should hoist any `guard` that might remain out of the loop regardless of where the variable comes from. — , Jan 10 '12 at 15:40
@duedl0r - I agree, should have come up with a better example. I bolded out the part of my question concerning this. — Rotem, Jan 10 '12 at 15:59
@delnan - No I have not, I was not aware the compiler can do this, I will profile it. — Rotem, Jan 10 '12 at 16:00
@Steve - I meant to dynamically build the entire function including the loop, not just the part inside the loop. — Rotem, Jan 10 '12 at 16:00
@KerrekSB - I mainly wanted to know how to get rid of unnecessary `if` checks in long loops, I think this problem doesn't rely heavily on the actual statements, does it? — Rotem, Jan 10 '12 at 16:03
Rotem: See the comment to my answer -- it's simply unclear from your question what the most general situation is that you need to address. If it's all about working out an increment value from a set of conditions, then my answer should apply; if not, then you'll need *some* sort of branching dispatch. — Kerrek SB, Jan 10 '12 at 16:05
@Kerrek If the value could be inferred outside of the loop this would be a non-issue. In my actual current case one option passes the values through a look-up table, one option multiplies the result by a matrix, and one option zeroes only some on the values, but I really don't understand why it matters if it's a given that the value can not be calculated outside the loop. — Rotem, Jan 10 '12 at 16:13
@Rotem: So, could the question be phrased as how to optimize `for (byte * b ...) { if (condition) { foo(*b); } else { bar(*b); }`? — Kerrek SB, Jan 10 '12 at 16:33
@Kerrek I suppose it could be phrased that way though I was trying to avoid calling other functions from within the loop. But perhaps it would have been easier to understand. — Rotem, Jan 10 '12 at 16:44
Use your second option above, if performance is an issue; otherwise use the first. There are other solutions that are more "elegant" and maybe a few lines shorter, but they obscure the logic of the program, and obscured logic is bug bait. — Hot Licks, Jan 10 '12 at 18:21

Mike Seymour · Accepted Answer · 2012-01-10T16:35:53.977

17

You could implement the loop as a template; the template argument is a compile-time constant, so optimisation ought to remove the unwanted code when it's false. You then need a wrapper to allow the correct specialisation to be called based on a run-time value:

template <bool secondaryModification>
void HeavyLoop(byte* startingAddress)
{
    for (int i = 0; i < 10000000; i++)
    {
        byte* b = startingAddress + i;
        *b+= 1;
        if (secondaryModification) *b+= 2;
    }
}

void HeavyLoop(byte* startingAddress, bool secondaryModification)
{
    if (secondaryModification) {
        HeavyLoop<true>(startingAddress);
    } else {
        HeavyLoop<false>(startingAddress);
    }
}

During compilation, both versions of the template will be instantiated (one containing *b+=2; and one not, and neither performing a runtime test on the argument); they should then be inlined in the wrapper function to generate exactly the same code as your second example - but without the need to duplicate any source code.

edited Jan 10 '12 at 16:35

answered Jan 10 '12 at 15:21

Mike Seymour

249,747
28
448
644

This seems interesting, I am unfamiliar with template functions in C++. I am not sure what you mean by 'the template argument is a compile-time constant'. Does the compiler statically create two separate functions (one for each possible value of `secondaryModification`), or is the function created at runtime before being called via `HeavyLoop(startingAddress)`? – Rotem Jan 10 '12 at 16:27
1

@Rotem: Two functions are generated at compile time; in general, template specialisations are always generated at compile time if they are needed. – Mike Seymour Jan 10 '12 at 16:32
Oh I see, so had you not been calling `HeavyLoop(startingAddress)` anywhere in the code, the specialization would not have been generated statically by the compiler? – Rotem Jan 10 '12 at 16:35
@Rotem: That's right. A template specialisation is only instantiated if it's needed, or if you explicitly instantiate it (i.e. `template void HeavyLoop(byte*);`). – Mike Seymour Jan 10 '12 at 16:43
@MikeSeymour now look what you've done: http://stackoverflow.com/questions/19221646/c-convert-variables-to-template-arguments – doctorlove Oct 07 '13 at 10:15

Matthieu M. · Answer 2 · 2012-01-10T18:11:36.323

12

EDIT: to better target what the OP desired and still remove the cruft, this post has been thoroughly edited.

Of course, I will assume that you have profiled this and this was shown to be a hot spot... right ?

In fact, I bet you did not. And that you are seriously underestimating your compiler.

For example, here is your code compiled with LLVM:

void f1(char*);
void f2(char*);

void loop(char* c, int n, int sm) {
  for (int i = 0; i < n; ++i) {
    if (sm) f1(c);
    else f2(c);
  }
}

Which yields:

define void @loop(i8* %c, i32 %n, i32 %sm) nounwind uwtable {
  %1 = icmp sgt i32 %n, 0
  br i1 %1, label %.lr.ph, label %._crit_edge

.lr.ph:                                           ; preds = %0
  %2 = icmp eq i32 %sm, 0
  br i1 %2, label %3, label %5

; <label>:3                                       ; preds = %3, %.lr.ph
  %i.01.us = phi i32 [ %4, %3 ], [ 0, %.lr.ph ]
  tail call void @f2(i8* %c) nounwind
  %4 = add nsw i32 %i.01.us, 1
  %exitcond = icmp eq i32 %4, %n
  br i1 %exitcond, label %._crit_edge, label %3

; <label>:5                                       ; preds = %5, %.lr.ph
  %i.01 = phi i32 [ %6, %5 ], [ 0, %.lr.ph ]
  tail call void @f1(i8* %c) nounwind
  %6 = add nsw i32 %i.01, 1
  %exitcond2 = icmp eq i32 %6, %n
  br i1 %exitcond2, label %._crit_edge, label %5

._crit_edge:                                      ; preds = %5, %3, %0
  ret void
}

Even if you don't know the LLVM IR, just follow the "sm" variable:

.lr.ph:                                           ; preds = %0
  %2 = icmp eq i32 %sm, 0
  br i1 %2, label %3, label %5

The compiler has generated two different loops (beginning at <label>:3 and <label>:5 respectively, and choose once and for all the loop to execute at the start of the fonction.

This is a fairly known compiler trick: Loop Invariant Code Motion (and derivated), so why bother doing it manually ? If it's worth it, the compiler will do it!

edited Jan 10 '12 at 18:11

answered Jan 10 '12 at 16:10

Matthieu M.

287,565
48
449
722

1

+1 for showing me something cool I didn't know about, though my question is how to deal with values that can not be predetermined outside the loop. I apologize if it was not clear from the question. – Rotem Jan 10 '12 at 16:31
@Rotem: I completely changed the answer to better exhibit the behavior you were looking for (using pure declarations so the compiler does not see through the various operations). See how it simply duplicated the loop code, specialized either for `sm == true` (label 5) or `sm == false` (label 3) and hoisted the comparison out the loop :) ? – Matthieu M. Jan 10 '12 at 18:13
That is extremely interesting and validates the claim made by others in this question. Thanks for taking the time. – Rotem Jan 10 '12 at 18:29
Too bad I can't give more than +1 - the only reasonable answer here. Why obfuscate the code with templates to do something any halfway decent compiler will do anyhow? (and if it doesn't you have much bigger performance problems than this anyhow) The best you can do with the manual optimization is avoid ONE branch (at least I really don't see how any of the given answers with templates, etc. does anything more than hoisting the invariant out of the loop) – Voo Jan 10 '12 at 18:51
2

@Matthieu: I tested this method (no manual optimization) vs. the template function Mike suggested. The test function takes 5 boolean parameters. When all parameters are passed as true, or in other words, when no code branches need to be discarded, there is no difference in performance. However, when they are all false, I do get a ~35% difference in performance. It is absolutely possible though, that I'm just not setting up some compiler switch properly due to lack of experience. – Rotem Jan 11 '12 at 09:43
@Rotem: it is possible that if there are many different switches then the compiler will not hoist all invariants out of the loop. In this case you can try and use the template solution. – Matthieu M. Jan 11 '12 at 14:16

Paul R · Answer 3 · 2012-01-10T15:33:01.327

10

One technique for this kind of problem which works in both C and C++ is to use an inline function. For C++ only you can just use a template function (effectively the same solution, but slightly more elegant).

Here's the inline solution for C/C++:

inline void HeavyLoop_inline(byte* startingAddress, bool secondaryModification)
{
    for (int i = 0; i < 10000000; i++)
    {
        byte* b = startingAddress+ i;
        *b+= 1;
        if (secondaryModification) *b+= 2;
    }
}

void HeavyLoop(byte* startingAddress, bool secondaryModification)
{
    if (secondaryModification)
    {
        HeavyLoop_inline(startingAddress, TRUE);
    }
    else
    {
        HeavyLoop_inline(startingAddress, FALSE);
    }
}

The reason this works (and is efficient) is that the value of secondaryModification that is passed to the inline function is a compile-time constant, so the compiler is able to optimise away any dead code for each invocation. This then gives you two "specialized" versions of the function.

Notes

Depending on the compiler you are using you may want to take additional steps to ensure that the inline function actually is inlined. E.g. for gcc you may add __attribute__ ((always_inline)).

Note also that some compilers will perform this kind of loop re-factoring optimisation without any intervention, so check your generated code first before trying to outsmart the compiler.

edited Jan 10 '12 at 15:33

answered Jan 10 '12 at 15:16

Paul R

208,748
37
389
560

and how would this improve performance? It only adds an additional method call, but keeps the if condition inside the loop... or does every compiler optimize the condition away?` – codeling Jan 10 '12 at 15:18
it will in-line the code with the same problem. the "if" is still there always. isn't it ? – Roee Gavirel Jan 10 '12 at 15:19
@Pubby: why is this a "terrible solution" ? – Paul R Jan 10 '12 at 15:22
3

+1 This should have the desired effect as long as the function is actually inlined; the flag is a compile-time constant, so the optimiser should be able to remove the dead code. I'd probably use a template parameter, to be sure the compiler treats it as a compile time constant (as I did in my answer, before I read this one). – Mike Seymour Jan 10 '12 at 15:23
2

+ The compiler should be able to tell that the argument is known, so it can either include or exclude the `*b += 2` at compile time. – Mike Dunlavey Jan 10 '12 at 15:23
3

+1, those who don't understand this should probably compile and disassemble it with suitable optimization. Inlining allows the optimizer to statically predict the branch `if (secondaryModification)`, so in the FALSE case it will be eliminated entirely, and in the true case the test of the condition will be eliminated and the `*b+=2` left in. And then hopefully combined with the `*b+=1`. – Steve Jessop Jan 10 '12 at 15:23
1

+1 for your detailed answer. As it's C++ though, I will go with the template functions as it seems more elegant as you said. – Rotem Jan 10 '12 at 16:38

Kerrek SB · Answer 4 · 2012-01-10T15:31:24.863

1

Why not like this:

void HeavyLoop(byte * startingAddress, bool secondaryModification)
{
    int const incr = secondaryModification ? 3 : 1;

    for (byte * b = startingAddress, * const end = startingAddress + 10000000;
         b != end; ++b)
    {
        *b += incr;
    }
}

You can of course put anything you like in the definiton of incr.

The insane might even write *(b++) += incr into the loop incrementor. A better way for lovers of arcane C syntax would be this:

byte * b = startingAddress, * const end = startingAddress + 10000000;
while (b != end) { *(b++) += incr; }

edited Jan 10 '12 at 15:31

answered Jan 10 '12 at 15:15

Kerrek SB

464,522
92
875
1,084

1

I think that the OP stated "Needless to say, the code is only an example and not the actual case. Please don't give solutions pertaining to *b+=1 per se." and your solution applied only to the given case. For more complicated solution, using a pointer to function would work, but might not really improve performance due to the overhead of the function call (compared to the simple branch). – Matthieu M. Jan 10 '12 at 16:00
1

@MatthieuM.: I think it's just really unclear what the OP wants. Surely you can replace the initialization of `incr` by any more complicated composition of conditions. If the entire loop body needs to be flexible, then we need something else (and indeed it's unclear that the indirection through a function pointer is better than the conditional), but that's not clear from the question. -1 for the OP for that. – Kerrek SB Jan 10 '12 at 16:04

Mark B · Answer 5 · 2012-01-10T16:54:31.977

EDIT: Given the new text in the OP "The issue is dealing with statements that can not be pre-calculated outside of the loop."

If the statement can't be pre-calculated outside the loop then by definition it must be computed inside the loop, and anything other than your original code with the check inside the loop will simply obfuscate the code. I assume that since you're asking this question there's still another detail you aren't showing us.

Original answer:

First, just write it the obvious way until profiling shows you that it's too slow. But in this case, you can compute the term up front:

void HeavyLoop(byte* startingAddress, bool secondaryModification)
{
    const int val = 1 + (secondaryModification ? 2 : 0);

    for (int i = 0; i < 10000000; i++)
    {
        byte* b = startingAddress + i;
        *b += val;
    }
}

I emphasized the part in my question that explains that `*b+=1` and `*b+=2` are examples and not relevant, sorry for any confusion. — Rotem, Jan 10 '12 at 16:06

Optimizing a Loop vs Code Duplication

5 Answers5

Linked