Preface:
Recently I have found real cases where Template meta-programming can be useful(can be used in my projects for efficiency improvement).
For example: 1. packing words of the given size of bits into an array of bytes, 2. bytes to bitset
conversion).
All these use cases has something in common: they unroll for
loops in compile time, and this is where performance improvement might come from.
Example: pack array of uint8_t where each element requires 6 bits and uses 8 bits:
inline void pack_8bits_to6bits(uint8_t* pIn, uint8_t* pOut, size_t nOfBytes) {
const int T = 3;
const int N = nOfBytes / 8 * 6;
for(int i = 0; i < N; ++i) {
// case 0
if(i % T == 0) pOut[i] = (pIn[i/3*4] << 2) + (pIn[i/3*4+1] >> 4);
// case 1
else if(i % T == 1) pOut[i] = (pIn[(i-1)/3*4+1] << 4) + (pIn[(i-1)/3*4+2] >> 2);
// case 2
else if(i % T == 2) pOut[i] = (pIn[(i-2)/3*4+2] << 6) + pIn[(i-2)/3*4+3];
}
}
we can find out, that as we know number of resulting bits in a compile time, we can replace for loops with manual source code in the manner like this:
pOut[0] = pIn[0] << 2 + pIn[1] >> 4;
pOut[1] = pIn[1] << 4 + pIn[2] >> 2;
pOut[2] = pIn[2] << 6 + pIn[1];
pOut[3] = pIn[4] << 2 + pIn[5] >> 4;
///....
pOut[23] = pIn[30] << 6 + pIn[31];
As it is boring to type manually we can use Template Meta programming for that purpose:
template <size_t nOfBytesIn>
void pack_8bits_to6bits(uint8_t* pIn, uint8_t* pOut) {
static const size_t nOfBytesOut = nOfBytesIn / 8 * 6;
LoopPack<nOfBytesOut-1>::pack_8bits_to6bits(pIn, pOut);
}
/* #### Switch statement */
template <size_t remainder, int I> struct SwitchStatementPack {
static inline void pack_8bits_to6bits(uint8_t* pIn, uint8_t* pOut) { }
};
template <int I> struct SwitchStatementPack<0, I> {
static inline void pack_8bits_to6bits(uint8_t* pIn, uint8_t* pOut) {
pOut[I] = (pIn[I/3*4] << 2) + (pIn[I/3*4+1] >> 4);
}
};
template <int I> struct SwitchStatementPack<1, I> {
static inline void pack_8bits_to6bits(uint8_t* pIn, uint8_t* pOut) {
pOut[I] = (pIn[(I-1)/3*4+1] << 4) + (pIn[(I-1)/3*4+2] >> 2);
}
};
template <int I> struct SwitchStatementPack<2, I> {
static inline void pack_8bits_to6bits(uint8_t* pIn, uint8_t* pOut) {
pOut[I] = (pIn[(I-2)/3*4+2] << 6) + pIn[(I-2)/3*4+3];
}
};
/*#### Loop*/
template <int I> struct LoopPack {
static const size_t T = 3; // period
static inline void pack_8bits_to6bits(uint8_t* pIn, uint8_t* pOut) {
// recursive step
SwitchStatementPack<I % T, I>::pack_8bits_to6bits(pIn, pOut);
LoopPack<I-1>::pack_8bits_to6bits(pIn, pOut);
}
};
// base case
template <> struct LoopPack<-1> {
static void pack_8bits_to6bits(uint8_t* pIn, uint8_t* pOut) { }
};
Question:
If we can specify using TMP for the compiler to unroll that loop, are current compilers smart enough to do optimization like this on their own?
Update: I updated the question to make it more clear, to show that I am interested in whether compiler can provide such optimization not in whether we can unroll for loop in a compile-time, from the question it should be clear, as I provide TMP solution for unrolling the loop - I know how to unroll it.