1

Essentially I want to repeat a line of code while changing the value of a single variable, just like a basic for loop.

Right now I'm looking at this:

#define UNROLL2(body) n = 0; body n++; body 
int n;
UNROLL2(std::cout << "hello " << n << "\n";)

Works all right, I have one issue with this however.

It relies on the compiler to optimize out the iteration of n and hopefully turn the variable indices into constants.

Is there a better way to construct such a macro? One that wouldn't rely on compiler optimizations?

At first I thought I could just use a defined value as n and redefine it as the macro churns on but.. cant do that.

Also, yes I'm aware most answers on similar topics despise macros and it is theoretically possible to unroll loops with templates. Using MSVC though, I found the results to be inconsistent if the code body requires capturing and while I could make it work without captures, it would make everything look far more confusing than just using macros.

user81993
  • 6,167
  • 6
  • 32
  • 64
  • Why don't you use a `for` loop? It is only 1 more line, and it is going to be less confusing than a macro. – Rakete1111 May 28 '17 at 10:11
  • @Rakete1111 I want to use it for a performance critical part of the program, for loops don't always get unrolled even with bounds known during compile time and there is no way to force MSVC to do it – user81993 May 28 '17 at 10:19
  • Have you profiled and proven that a for loop or template unrolling is a cause for a significant slowdown? Also I might be biased, but I wouldn't use MSVC on a "performance critical" program. As for the macro loop, [this](https://stackoverflow.com/questions/28231743/self-unrolling-macro-loop-in-c-c) came up. – DeiDei May 28 '17 at 10:28
  • You're doing premature optimisation. For loops aren't always unrolled but a good-quality compiler can usually optimise code better than most programmers can (since specialised knowledge about the target platform is encoded in the compiler implementation). The more viable strategy than your desired macro is (1) write the code in the simplest way that produces the required results (in your case, a loop) (2) test and profile to identify code hot-spots that affect performance and (3) only then lovingly hand-optimise the code to address the IDENTIFIED performance concern. – Peter May 28 '17 at 10:29
  • @peter already at step 3, Its just kinda exhausting writing unrolls by hand beyond the count of 8 – user81993 May 28 '17 at 10:36
  • 2
    If the `for` loop isn't getting unrolled, it's because unrolling it isn't a performance-win. It very likely would not be in this case, since the overhead of console I/O is *vastly* greater than that overhead of a simple conditional branch. Don't second-guess your optimizer until you have a good reason to do so. – Cody Gray - on strike May 28 '17 at 10:42
  • @CodyGray not the case, unrolling by hand to 8 has yielded ~30% performance gain while letting the compiler decide makes almost no difference from the baseline. Also, the question isn't about how to best optimize my code but rather about whether or not I can do something with macros. – user81993 May 28 '17 at 10:47
  • 1
    You saw a 30% performance gain by unrolling something that calls `std::cout` 8 times? I call BS. If you're talking about unrolling a different kind of loop, then maybe so; MSVC doesn't tend to be very aggressive with loop unrolling when you aren't doing a PGO build. Also, what I posted was a comment, not an answer. – Cody Gray - on strike May 28 '17 at 10:51
  • I think essentially you assume something is slow and can be made faster by using some custom code. So you came up with a macro solution which you assume can tackle the assumed performance issue and came here to ask how to do that specifically. I'm not saying your assumptions are wrong, yet I think you can get a much better answer by not making assumptions for now but just show some actual code which is proven to be a bottleneck, than ask 'how to optimize this?'. – stijn May 28 '17 at 10:58
  • @CodyGray Why on earth would anyone unroll std::cout calls for performance? Please consider what you're arguing over and what you postulated, pettiness is not a good trait. – user81993 May 28 '17 at 11:01
  • I…uh…wait, what? I'm talking about the code that appears in your question. I have no idea why you are insulting me. I didn't make this up. – Cody Gray - on strike May 28 '17 at 11:03
  • @stijn That would be essentially asking somebody else to write (not a trivial amount) of code for you, while optimizing the code is my ultimate goal, there is no reason I can't try out various methods myself and instead seek help in implementing those specific methods. – user81993 May 28 '17 at 11:05

2 Answers2

2

In your question, you presented an example that performed console I/O, and I made the comment that console I/O has an overhead substantially larger than that of a loop construct (conditional branch), so it makes very little sense for this type of thing to be unrolled. This is a case where a smart optimizer would probably not unroll, because the increase in code size would not pay dividends in speed. However, based on your follow-up comments, it appears that this was just a little throw-away example, and that I shouldn't have focused so much on the specifics.

In fact, I completely understand what you are saying about MSVC not unrolling loops. Even with optimizations enabled, it tends not to do loop-unrolling unless you are using profile-guided optimizations. Something as trivial as:

void Leaf();

void MyFunction()
{
    for (int i = 0; i < 2; ++i)  { Leaf(); }
}

gets transformed into:

    push  rbx
    sub   rsp, 32
    mov   ebx, 2      ; initialize loop counter to 2
    npad  5
Loop:
    call  Leaf
    sub   rbx, 1      ; decrement loop counter
    jne   SHORT Loop  ; loop again if loop counter != 0
    add   rsp, 32
    pop   rbx
    ret

even at /O2, which is just pathetic.

I discovered this a while ago, and looked to see if it had already been reported as a defect. Unfortunately, Microsoft recently performed a massive purge of all their old bugs from Connect, so you can't go back very far in the archives, but I did find this similar bug. That one got closed as being related to intrinsics, which was either a misunderstanding or a cop-out, so I opened a new, simplified one, based on the code shown above. I'm still awaiting a meaningful response. Seems like pretty low-hanging fruit to me, as far as optimizations go, and all competing compilers will do this, so this is extremely embarrassing for Microsoft's compiler.

So yeah, if you can't switch compilers, and PGO isn't helping you (or you can't use it either), I totally understand why you might want to do some type of manual unrolling. But I don't really understand why you are template-averse. The reason to use templates isn't about despising macros, but rather because they provide a much cleaner, more powerful syntax, while equally guaranteeing that they will be evaluated/expanded at compile time.

You can have something like:

template <int N>
struct Unroller
{
   template <typename T>
   void operator()(T& t)
   {
      t();
      Unroller<N-1>()(t);
   }
};

template <>
struct Unroller<0>
{
   template <typename T>
   void operator()(T&)
   { }
};

and combine it with a functor that can be as simple or as complex as you need it to be:

struct MyOperation
{
   inline void operator()() { Leaf(); }
};

so that, with the magic of recursive template expansion, you can do:

void MyFunction()
{
   MyOperation op;
   Unroller<16>()(op);
}

and get precisely the output you expect:

sub  rsp, 40
call Leaf
call Leaf
call Leaf
call Leaf
call Leaf
call Leaf
call Leaf
call Leaf
call Leaf
call Leaf
call Leaf
call Leaf
call Leaf
call Leaf
call Leaf
add  rsp, 40
jmp  Leaf

Naturally, this is a simple example, but it shows you that the optimizer is even able to do a tail-call optimization here. Because the template magic works with a functor, as I said above, you can make the logic to be unrolled as complicated as it needs to be, adding member variables, etc. It'll all get unrolled because the templates are expanded recursively at compile time.

Literally the only disadvantage that I can find to this is that it bloats the object file a bit with all of the template expansions. In this case, with Unroller<16>, I get 17 different function definitions emitted in the object file. But, aside from a minor impact on compile times, that's no big deal, because they won't be included in the final binary output. It would obviously be better if the compiler's optimizer was smart enough to do this on its own, but until that time, this is a viable solution for holding its hand and forcing it to generate the code you want, and I think it's much cleaner than the macro-based approach.

Cody Gray - on strike
  • 239,200
  • 50
  • 490
  • 574
0

This can be done using Boost Preprocessor library (with macro BOOST_PP_REPEAT), but please bear in mind that the fact that you can does not mean that you should.

#include <iostream>
#include <boost/preprocessor/repetition/repeat.hpp>

#define DECL(z, n, text) std::cout << "n = " << n << std::endl;

int main()
{
  int n = 0;
  BOOST_PP_REPEAT(5, DECL, "");
}
KCH
  • 2,794
  • 2
  • 23
  • 22