I have a loop with a switch inside that looks something like this (but much more complex).
for(int i = 0; i < N; i += inc)
{
v = 0.0;
switch(ConstVal)
{
case 2: v += A[i];
case 1: v += A[i+k];
case 0: v += A[i+j];
}
// do some stuff with v
}
ConstVal is unknown at compile time but fixed during the initialization routine. Is there any way to remove the switch statement without compiling multiple variants of the for loop? Given that x86 has indirect branching, there should be a simple way to inline assembly to jump to the case I want, rather than back to the top of the loop each iteration. How would you do this (in gcc)? Finally, can this be done without interfering with the compiler's optimization analysis. I'm already manually unrolling loops, but I'm sure there are lots more optimizations going on which I don't want to break.
It's my understanding that the Julia meta-programming feature gives you access to the parser and abstract syntax tree. In combination with JIT, you can resolve this kind of problem. I would think there would be some reasonable workaround in C even without semantics for indirect branch. Note that Duff's device is not a solution since I want to return to the same case statement on each loop iteration. This issue comes up frequently.
EDIT
I discovered there is no conditional indirect branch x86 instruction. Further, gcc inline assembly only allows for fixed labels. And yet, using gcc extensions, this can still be done. See, for example, https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html#Labels-as-Values.
This is how. In my code, it was difficult to determine if there was any performance difference, but on another machine or with a much smaller and simpler loop, it may make a difference.
void *forswitch;
switch(ConstVal)
{
case 2: forswitch = &&C; break;
case 1: forswitch = &&B; break;
case 0: forswitch = &&A; break;
}
void *_forswitch[] = {forswitch, &&END_FOR_SWITCH};
i = 0;
{
v = 0.0;
C: v += _A[i];
B: v += _A[i+k];
A: v += _A[i+j];
// do some stuff with v
i += inc;
forswitch = _forswitch[i==N];
//forswitch = (i<N)? forswitch: &&END_FOR_SWITCH;
goto *forswitch;
}
END_FOR_SWITCH:
return;
I've replaced the for loop with my own implementation based on the gcc extension that gives access to machine level indirect branching. There are a couple of ways to do this. The first is to index an array which jumps to the conditional start of the loop or the end of the loop depending on the loop index. The other way (commented) is to conditionally set the branch register each time. The compiler should replace any branch with a conditional move (CMOV).
There are a number of obvious issues with this solution. (1) It's not portable. (2) By implementing the loop myself, it's not only harder to understand the code but may interfere with compiler optimizations (such as automatic loop unrolling). (3) The compiler can't jointly optimize the entire switch statement even though there are no breaks because it has no knowledge at compile time of which statements will actually be executed. However, it may be able to cleverly re-organize the switch in a manner similar to how others have pointed out in some of the responses below. By manually implementing the switch myself (in combination with the for loop), I'm making it much more difficult for the compiler to make any such optimization since by removing the switch semantics, my intent is obscured by the optimizations.
Nevertheless, if it made a significant performance improvement, I still think this would be better than having multiple copies of the code. With macros, the non-portable extensions could probably be compiled conditionally; this could basically be made to look like a normal loop.
EDIT 2
I've found a much better solution which is more portable and more effective. When you have a situation where there are a small number of possible run-time determined options, you can make a wrapper around the optimized function, fix all the run-time constants, and then inline the function for each copy of constants. If there is only a single constant, you can use a lookup table of function pointers, each of which sets the constant and inlines the function. If you have a more complex situation, you'll need some if-elseif-else control structure. One of the functions can be left with all free variables so there is no loss of generality. I'm thinking of this as being a sort of compile-time closure. The compiler is doing all the hard work without any messy macros or otherwise duplicate code to maintain.
In my code, this resulted is a 10% to 20% performance increase on already significantly optimized code (due to hard-coding of various constants and not really anything to do with the switch itself). In my toy example, the change would look something like this.
inline void __foo(const int ConstVal)
{
for(int i = 0; i < N; i += inc)
{
v = 0.0;
switch(ConstVal)
{
case 2: v += A[i];
case 1: v += A[i+k];
case 0: v += A[i+j];
}
// do some stuff with v
}
}
void foo()
{
// this could be done with a lookup table
switch(ConstVal)
{
case2: __foo(2); break;
case1: __foo(1); break;
case0: __foo(0); break;
}
}
By inlining __foo, the compiler will eliminate the switch as well as any other constants that you pass along. You will, of course, end up with more compiled code, but for a small optimized routine, this shouldn't be a big deal.