4

My question is similar to Can one unroll a loop when working with an integer template parameter? but I want to mix compile time and runtime. Specifically, I know at compile time a constant NBLOCK and I want to write a switch on a variable start_block which is only known at runtime where NBLOCK is the number of entries in the switch. Here is what I got using macros:

#define CASE_UNROLL(i_loop)         \
  case i_loop : \
    dst.blocks[i_loop+1] -= (load_unaligned_epi8(srcblock) != zero) & block1; \
    srcblock += sizeof(*srcblock);

  switch(start_block)
    {
      CASE_UNROLL(0);
#if NBLOCKS > 2
      CASE_UNROLL(1);
#endif
#if NBLOCKS > 3
      CASE_UNROLL(2);
#endif
#if NBLOCKS > 4
      CASE_UNROLL(3);
#endif
...
...
#if NBLOCKS > 15
      CASE_UNROLL(14);
#endif
#if NBLOCKS > 16
#error "Too many blocks"
#endif
    }

I find it very ugly. Especially if I want to raise the bound from 16 to 32.

I would like to know if it is possible to write that using some template meta programming. The hard part is that for performance reasons it is crucial that the switch is compiled with a jump table than a sequence of nested conditional.

Note that the question is very similar to C++/C++11 - Switch statement for variadic templates? but as far as I understand the solution proposed here is to remove the switch by using a mixed compile/tun time initialized functions pointer array. I can't pay the prince a calling a function here.

I'm working with GCC if some nasty extensions is needed.

Community
  • 1
  • 1
hivert
  • 10,579
  • 3
  • 31
  • 56
  • 1
    possible duplicate of [How to unroll a short loop in C++ using templates?](http://stackoverflow.com/questions/2382137/how-to-unroll-a-short-loop-in-c-using-templates) – Paul R Jun 07 '13 at 10:56
  • No it's not a duplicate. In my case, the starting point is known at runtime and the ending point at compile time. GCC is able to unroll the code but it use a lot of jump. This version is much faster because there is only one jump. However it uses macro. I'd rather write a template. – hivert Jun 07 '13 at 23:29
  • my I ask why? Is it really faster than the compiled code? – Tobias Langner Jun 13 '13 at 11:34
  • Yes 30% faster. The switch here is in my very inner computation loop. – hivert Jun 13 '13 at 14:27

2 Answers2

2

You could simply use Boost.Preprocessor with BOOST_PP_REPEAT(COUNT, MACRO, DATA):

#define APPLY_FUNC(INDEX, FUNC) FUNC(INDEX);

// ...

switch(start_block)
{
    BOOST_PP_REPEAT(NBLOCK, APPLY_FUNC, CASE_UNROLL);
}

That should be expanded to:

switch(start_block)
{
    CASE_UNROLL(0);
    CASE_UNROLL(1);
    CASE_UNROLL(2);
    // ...
    CASE_UNROLL(NBLOCK-1);
}
Morwenn
  • 21,684
  • 12
  • 93
  • 152
  • Thanks for the tip. It still doesn't answer the question if it's doable with template. I'm thinking that it's not but I'd like to have the answer of an expert. – hivert Jul 15 '13 at 22:50
1

Template based unrolling:

template<int N>
struct loopUnroller
{
  template<typename Operation>
  inline void operator(Operation& op) { op(); loopUnroller<N-1>(op); }
};

template<>
struct loopUnroller<0>
{
  template<typename Operation>
  inline void operator(Operation& op) { op(); }
};

A call to loopUnroller<6>(Foo) will likely be inlined, but also contain a call to an inlined loopUnroller<5>(Foo) etc. Each level adds an extra call to Foo().

If your compiler refuses to inline 16 levels deep, there's a simple fix:

template<>
struct loopUnroller<16>
{
  template<typename Operation>
  inline void operator(Operation& op) { 
        op(); op(); op(); op();
        op(); op(); op(); op();
        op(); op(); op(); op();
        op(); op(); op(); op();
  }
};

With logarithmic complexity:

template<int N>
struct loopUnroller
{
  template<typename Operation>
  inline void operator(Operation& op) { 
       loopUnroller<N/2>(op);
       loopUnroller<N/2>(op);
       if (N%1) { op(); } // Will be optimized out if N is even.
  }
};

With dynamic complexity:

template<int L>
struct loopUnroller
{
  template<typename Operation>
  inline void operator(Operation& op, int i) {
     if (i & (1<<L)) {
       for(int j = 0; j != 1<<L; ++j)
       {
         op();
       }
     }
     loopUnroller<L-1>(op, i);
  }
};

The for loop now has a fixed runtime length, making it likely to be unrolled. So you have an unrolled loop of length 32,16,8,4,2 and 1 (assuming no specializations) and at runtime you choose the loops based on the bits of i.

MSalters
  • 173,980
  • 10
  • 155
  • 350
  • To point is that I want to make sure that there is a switch since I only know at runtime to starting point of the loop. – hivert Aug 23 '13 at 09:50
  • @hivert: That's an XY problem again. You're assuming a specific solution. The last snippet of code has `int i` as a runtime starting point too, just no switch. – MSalters Aug 23 '13 at 10:05