0

Preface:
Recently I have found real cases where Template meta-programming can be useful(can be used in my projects for efficiency improvement).
For example: 1. packing words of the given size of bits into an array of bytes, 2. bytes to bitset conversion).
All these use cases has something in common: they unroll for loops in compile time, and this is where performance improvement might come from.

Example: pack array of uint8_t where each element requires 6 bits and uses 8 bits:

inline void pack_8bits_to6bits(uint8_t* pIn, uint8_t* pOut, size_t nOfBytes) {
  const int T = 3;
  const int N = nOfBytes / 8 * 6;
  for(int i = 0; i < N; ++i) {
    // case 0
    if(i % T == 0) pOut[i] = (pIn[i/3*4] << 2) + (pIn[i/3*4+1] >> 4);
    // case 1
    else if(i % T == 1) pOut[i] = (pIn[(i-1)/3*4+1] << 4) + (pIn[(i-1)/3*4+2] >> 2);
    // case 2
    else if(i % T == 2) pOut[i] = (pIn[(i-2)/3*4+2] << 6) + pIn[(i-2)/3*4+3];
    }
  }

we can find out, that as we know number of resulting bits in a compile time, we can replace for loops with manual source code in the manner like this:

pOut[0] = pIn[0] << 2 + pIn[1] >> 4;
pOut[1] = pIn[1] << 4 + pIn[2] >> 2;
pOut[2] = pIn[2] << 6 + pIn[1];
pOut[3] = pIn[4] << 2 + pIn[5] >> 4;
///....
pOut[23] = pIn[30] << 6 + pIn[31];

As it is boring to type manually we can use Template Meta programming for that purpose:

template <size_t nOfBytesIn>
void pack_8bits_to6bits(uint8_t* pIn, uint8_t* pOut) {
  static const size_t nOfBytesOut = nOfBytesIn / 8 * 6;
  LoopPack<nOfBytesOut-1>::pack_8bits_to6bits(pIn, pOut);
}
/* #### Switch statement */
template <size_t remainder, int I> struct SwitchStatementPack {
  static inline void pack_8bits_to6bits(uint8_t* pIn, uint8_t* pOut) { }
};

template <int I> struct SwitchStatementPack<0, I> {
  static inline void pack_8bits_to6bits(uint8_t* pIn, uint8_t* pOut) {
    pOut[I] = (pIn[I/3*4] << 2) + (pIn[I/3*4+1] >> 4);
  }
};
template <int I> struct SwitchStatementPack<1, I> {
  static inline void pack_8bits_to6bits(uint8_t* pIn, uint8_t* pOut) {
    pOut[I] = (pIn[(I-1)/3*4+1] << 4) + (pIn[(I-1)/3*4+2] >> 2);
  }
};
template <int I> struct SwitchStatementPack<2, I> {
  static inline void pack_8bits_to6bits(uint8_t* pIn, uint8_t* pOut) {
    pOut[I] = (pIn[(I-2)/3*4+2] << 6) + pIn[(I-2)/3*4+3];
  }
};
/*#### Loop*/
template <int I> struct LoopPack {
  static const size_t T = 3; // period
  static inline void pack_8bits_to6bits(uint8_t* pIn, uint8_t* pOut) {
    // recursive step
    SwitchStatementPack<I % T, I>::pack_8bits_to6bits(pIn, pOut);
    LoopPack<I-1>::pack_8bits_to6bits(pIn, pOut);
  }
};
// base case
template <> struct LoopPack<-1> {
  static void pack_8bits_to6bits(uint8_t* pIn, uint8_t* pOut) { }
};

Question:
If we can specify using TMP for the compiler to unroll that loop, are current compilers smart enough to do optimization like this on their own?

Update: I updated the question to make it more clear, to show that I am interested in whether compiler can provide such optimization not in whether we can unroll for loop in a compile-time, from the question it should be clear, as I provide TMP solution for unrolling the loop - I know how to unroll it.

Deduplicator
  • 44,692
  • 7
  • 66
  • 118
spin_eight
  • 3,925
  • 10
  • 39
  • 61
  • 4
    Yes, if the loop parameters can be determined at compile time (which is likely the case if the function call is inlined and called with constants). Check the disassembly of an optimized build to see. – Cameron May 24 '16 at 16:11
  • Boost.Hana provides a utility for unrolling a loop by way of `5_c.times(function)`, where `5_c` is an instance of its own `integral_constant` type. – chris May 24 '16 at 16:13
  • @Cameron I intentionally provided the function signature without constants, only nOfBytes is const. If all arguments are constant this is another case, and that about if this isn't the case? – spin_eight May 24 '16 at 16:14
  • @spin_eight _"are current compilers smart enough to do optimization like this on their own?"_ The answer is already given in the dupes accepted answer. – πάντα ῥεῖ May 24 '16 at 16:19
  • The `N` variable is `const`, but its value is not constant -- for all we know the function could be called with user input, in which case the compiler can't (completely) unroll the loop at compile time. But if the compiler inlines the function, then on a case-by-case basis it can determine which loops have a constant number of iterations and which cannot be determined in advance. Then it can decide how much, if any, unrolling to do. All of this is dependent on the specifics of your calling code, compiler, and optimization level. But to answer your question, it's *possible*, sometimes likely. – Cameron May 24 '16 at 21:10
  • How about adding N as a template parameter to pack_8bits_to6bits and then add a wrapper function with a switch statement. Then you could use gccs -funroll-loops parameter which normally works pretty well if N is a compile time constant (which it would be with this construct). Something along the lines of http://pastebin.com/Vb4qQLKB – Troels May 25 '16 at 21:52

0 Answers0