In your question, you presented an example that performed console I/O, and I made the comment that console I/O has an overhead substantially larger than that of a loop construct (conditional branch), so it makes very little sense for this type of thing to be unrolled. This is a case where a smart optimizer would probably not unroll, because the increase in code size would not pay dividends in speed. However, based on your follow-up comments, it appears that this was just a little throw-away example, and that I shouldn't have focused so much on the specifics.
In fact, I completely understand what you are saying about MSVC not unrolling loops. Even with optimizations enabled, it tends not to do loop-unrolling unless you are using profile-guided optimizations. Something as trivial as:
void Leaf();
void MyFunction()
{
for (int i = 0; i < 2; ++i) { Leaf(); }
}
gets transformed into:
push rbx
sub rsp, 32
mov ebx, 2 ; initialize loop counter to 2
npad 5
Loop:
call Leaf
sub rbx, 1 ; decrement loop counter
jne SHORT Loop ; loop again if loop counter != 0
add rsp, 32
pop rbx
ret
even at /O2
, which is just pathetic.
I discovered this a while ago, and looked to see if it had already been reported as a defect. Unfortunately, Microsoft recently performed a massive purge of all their old bugs from Connect, so you can't go back very far in the archives, but I did find this similar bug. That one got closed as being related to intrinsics, which was either a misunderstanding or a cop-out, so I opened a new, simplified one, based on the code shown above. I'm still awaiting a meaningful response. Seems like pretty low-hanging fruit to me, as far as optimizations go, and all competing compilers will do this, so this is extremely embarrassing for Microsoft's compiler.
So yeah, if you can't switch compilers, and PGO isn't helping you (or you can't use it either), I totally understand why you might want to do some type of manual unrolling. But I don't really understand why you are template-averse. The reason to use templates isn't about despising macros, but rather because they provide a much cleaner, more powerful syntax, while equally guaranteeing that they will be evaluated/expanded at compile time.
You can have something like:
template <int N>
struct Unroller
{
template <typename T>
void operator()(T& t)
{
t();
Unroller<N-1>()(t);
}
};
template <>
struct Unroller<0>
{
template <typename T>
void operator()(T&)
{ }
};
and combine it with a functor that can be as simple or as complex as you need it to be:
struct MyOperation
{
inline void operator()() { Leaf(); }
};
so that, with the magic of recursive template expansion, you can do:
void MyFunction()
{
MyOperation op;
Unroller<16>()(op);
}
and get precisely the output you expect:
sub rsp, 40
call Leaf
call Leaf
call Leaf
call Leaf
call Leaf
call Leaf
call Leaf
call Leaf
call Leaf
call Leaf
call Leaf
call Leaf
call Leaf
call Leaf
call Leaf
add rsp, 40
jmp Leaf
Naturally, this is a simple example, but it shows you that the optimizer is even able to do a tail-call optimization here. Because the template magic works with a functor, as I said above, you can make the logic to be unrolled as complicated as it needs to be, adding member variables, etc. It'll all get unrolled because the templates are expanded recursively at compile time.
Literally the only disadvantage that I can find to this is that it bloats the object file a bit with all of the template expansions. In this case, with Unroller<16>
, I get 17 different function definitions emitted in the object file. But, aside from a minor impact on compile times, that's no big deal, because they won't be included in the final binary output. It would obviously be better if the compiler's optimizer was smart enough to do this on its own, but until that time, this is a viable solution for holding its hand and forcing it to generate the code you want, and I think it's much cleaner than the macro-based approach.