0

I have a loop with a switch inside that looks something like this (but much more complex).

for(int i = 0; i < N; i += inc)
{
    v = 0.0;
    switch(ConstVal)
    {
    case 2: v += A[i];
    case 1: v += A[i+k];
    case 0: v += A[i+j];
    }
    // do some stuff with v
}

ConstVal is unknown at compile time but fixed during the initialization routine. Is there any way to remove the switch statement without compiling multiple variants of the for loop? Given that x86 has indirect branching, there should be a simple way to inline assembly to jump to the case I want, rather than back to the top of the loop each iteration. How would you do this (in gcc)? Finally, can this be done without interfering with the compiler's optimization analysis. I'm already manually unrolling loops, but I'm sure there are lots more optimizations going on which I don't want to break.

It's my understanding that the Julia meta-programming feature gives you access to the parser and abstract syntax tree. In combination with JIT, you can resolve this kind of problem. I would think there would be some reasonable workaround in C even without semantics for indirect branch. Note that Duff's device is not a solution since I want to return to the same case statement on each loop iteration. This issue comes up frequently.

EDIT

I discovered there is no conditional indirect branch x86 instruction. Further, gcc inline assembly only allows for fixed labels. And yet, using gcc extensions, this can still be done. See, for example, https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html#Labels-as-Values.

This is how. In my code, it was difficult to determine if there was any performance difference, but on another machine or with a much smaller and simpler loop, it may make a difference.

void *forswitch;
switch(ConstVal)
{
case 2: forswitch = &&C; break;
case 1: forswitch = &&B; break;
case 0: forswitch = &&A; break;
}
void *_forswitch[] = {forswitch, &&END_FOR_SWITCH};


i = 0;
{
    v = 0.0;
    C: v += _A[i];
    B: v += _A[i+k];
    A: v += _A[i+j];

    // do some stuff with v

    i += inc;
    forswitch = _forswitch[i==N];
    //forswitch = (i<N)? forswitch: &&END_FOR_SWITCH;
    goto *forswitch;
}

END_FOR_SWITCH:
return;

I've replaced the for loop with my own implementation based on the gcc extension that gives access to machine level indirect branching. There are a couple of ways to do this. The first is to index an array which jumps to the conditional start of the loop or the end of the loop depending on the loop index. The other way (commented) is to conditionally set the branch register each time. The compiler should replace any branch with a conditional move (CMOV).

There are a number of obvious issues with this solution. (1) It's not portable. (2) By implementing the loop myself, it's not only harder to understand the code but may interfere with compiler optimizations (such as automatic loop unrolling). (3) The compiler can't jointly optimize the entire switch statement even though there are no breaks because it has no knowledge at compile time of which statements will actually be executed. However, it may be able to cleverly re-organize the switch in a manner similar to how others have pointed out in some of the responses below. By manually implementing the switch myself (in combination with the for loop), I'm making it much more difficult for the compiler to make any such optimization since by removing the switch semantics, my intent is obscured by the optimizations.

Nevertheless, if it made a significant performance improvement, I still think this would be better than having multiple copies of the code. With macros, the non-portable extensions could probably be compiled conditionally; this could basically be made to look like a normal loop.

EDIT 2

I've found a much better solution which is more portable and more effective. When you have a situation where there are a small number of possible run-time determined options, you can make a wrapper around the optimized function, fix all the run-time constants, and then inline the function for each copy of constants. If there is only a single constant, you can use a lookup table of function pointers, each of which sets the constant and inlines the function. If you have a more complex situation, you'll need some if-elseif-else control structure. One of the functions can be left with all free variables so there is no loss of generality. I'm thinking of this as being a sort of compile-time closure. The compiler is doing all the hard work without any messy macros or otherwise duplicate code to maintain.

In my code, this resulted is a 10% to 20% performance increase on already significantly optimized code (due to hard-coding of various constants and not really anything to do with the switch itself). In my toy example, the change would look something like this.

inline void __foo(const int ConstVal)
{
    for(int i = 0; i < N; i += inc)
    {
        v = 0.0;
        switch(ConstVal)
        {
        case 2: v += A[i];
        case 1: v += A[i+k];
        case 0: v += A[i+j];
        }
        // do some stuff with v
    }
}

void foo()
{
    // this could be done with a lookup table
    switch(ConstVal)
    {
    case2: __foo(2);  break;
    case1: __foo(1);  break;
    case0: __foo(0);  break;
    }
}

By inlining __foo, the compiler will eliminate the switch as well as any other constants that you pass along. You will, of course, end up with more compiled code, but for a small optimized routine, this shouldn't be a big deal.

Todd
  • 475
  • 4
  • 15
  • You could just invert the nesting of the `switch` and `for` (though you'd need to write a `for` for each `case`). – EOF Nov 04 '16 at 21:39
  • Do you _intentionally_ have no `break` statement after each case??? – Paul Ogilvie Nov 04 '16 at 21:41
  • Yes, no break statement intentionally. This is only to demonstrate the problem. The actual code is far more complex which is why I don't want multiple copies. – Todd Nov 04 '16 at 22:02

3 Answers3

2

No, I don't see any way to optimize the switch statement away. Besides, it is not that expensive. Because there are no break statements, the switch has a "fall thru" tendency. It translates to:

    switch(ConstVal)
    {
    case 2: v= A[i] + A[i+k] + A[i+j]; break;
    case 1: v=        A[i+k] + A[i+j]; break;
    case 0: v=                 A[i+j]; break;
    }
    // do some stuff with v

and I don't see how to remove the dependency on ConstVal.

You could make a switch before the loop with 3 loops, one for for each value of ConstVal but that will surely look like ugly code, depending on what do some stuff with v does.

Paul Ogilvie
  • 25,048
  • 4
  • 23
  • 41
  • Another reason to remove the conditional is to port this to a GPU. On a GPU the threads share a program counter and conditional execution is much more problematic. Since the control flow will never change, perhaps it doesn't matter, but I don't have the experience yet to know. – Todd Nov 04 '16 at 22:14
  • Then you might prefer "ugly code" but which is more optimized to use the hardware. – Paul Ogilvie Nov 04 '16 at 22:23
1

When do you know ConstVal, and how often does it change? If you can recompile a small routine and re-link the whole thing whenever ConstVal changes, that will solve your problem.

Having said that, do you know this is a problem? Is that switch responsible for 10% or more of execution time? It is very common for people to focus on non-problems. They know they should "profile first" but they don't actually do it. (It's a case of "ready-fire-aim" :)

A method many people rely on is this.

Community
  • 1
  • 1
Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
  • Are you talking about JIT of the main loop once I know ConstVal? I thought about that but seems it would be really hard without builtin language support. The performance is mostly insignificant at the moment, but (1) it's a problem I have frequently when doing signal processing, (2) any conditionals in the loop tend to interfere with compiler optimizations and I'm still working on the code, and (3) see my comment about GPUs. – Todd Nov 06 '16 at 17:18
  • @Todd: I'm asking if you can (maybe you can't) say something like `const int ConstVal = PREPROCESSOR_SYMBOL_PASSED_TO_COMPILER;`. Then just compile it when you know the number. If you -can't-, *and if it's actually a problem*, I would put the `switch` outside, have 3 copies of the `for` loop, and unroll each loop like 8 times (and not depend on the compiler to do it). If anybody sneers at you and says "that code smells" tell them to shove it - what you need is speed. – Mike Dunlavey Nov 06 '16 at 17:46
  • Ah OK. Unfortunately I need to be able to change it without recompiling. It's not totally unreasonable to make multiple copies of the code. It's just that usually when I think code is done and it won't change, turns out I need to change it. Then I have some work cut out for me. – Todd Nov 06 '16 at 18:04
  • @Todd: You can use macros to reduce the number of code lines you have to change. (I've had academics sneer at such use of macros, but they tend to do that. It can make the code a lot more maintainable, without giving up performance.) – Mike Dunlavey Nov 06 '16 at 20:08
0

Is there any way to remove the switch statement without compiling multiple variants of the for loop?

In general it's better either to make different variants or just leave as is. But in this case maybe we can invent some trick. Kind of this

switch (constVal) {
case 2:
    A1 = A;
    B1 = A + k;
    C1 = A + j;
    break;
case 1:
    A1 = big_zero_array;
    B1 = A + k;
    C1 = A + j;
    break;
case 0:
    A1 = B1 = big_zero_array;
    C1 = A + j
    break;
}

for (int i = 0; i < N; i += inc) {
    v = A1[i] + B1[i] + C1[i];
    //....
}

Still this thing requires additional memory and may be even slower under some circumstances.

Matt
  • 13,674
  • 1
  • 18
  • 27