Self-unrolling macro loop in C/C++

Question

I am currently working on a project, where every cycle counts. While profiling my application I discovered that the overhead of some inner loop is quite high, because they consist of just a few machine instruction. Additionally the number of iterations in these loops is known at compile time.

So I thought instead of manually unrolling the loop with copy & paste I could use macros to unroll the loop at compile time so that it can be easily modified later.

What I image is something like this:

#define LOOP_N_TIMES(N, CODE) <insert magic here>

So that I can replace for (int i = 0; i < N, ++i) { do_stuff(); } with:

#define INNER_LOOP_COUNT 4
LOOP_N_TIMES(INNER_LOOP_COUNT, do_stuff();)

And it unrolls itself to:

do_stuff(); do_stuff(); do_stuff(); do_stuff();

Since the C preprocessor is still a mystery to me most of the time, I have no idea how to accomplish this, but I know it must be possible because Boost seems to have a BOOST_PP_REPEAT macros. Unfortunately I can't use Boost for this project.

I am using a modified version of GCC for the architecture I am working on. So I think technically yes. — Karsten, Jan 30 '15 at 08:17
Have you looked at [-funroll-loops](http://gcc.gnu.org/onlinedocs/gcc-3.4.4/gcc/Optimize-Options.html)? — dmg, Jan 30 '15 at 08:28
The compiler does not unroll this loop regardless of what I configure it to do. Side note: I always want to know how this can be done for educational purposes, not just for this specific case. — Karsten, Jan 30 '15 at 08:30
Why you cannot use Boost for this? If it's for technical reason (which seems unlikely) then I doubt that you can do this at all. After all, Boost PP is header only library, if I have understood correctly. If nothing else, you should be able to look from Boost how it could done by yourself. — user694733, Jan 30 '15 at 08:36
@user694733: I can't use Boost because the project must not have any dependencies. I looked at the source code of `BOOST_PP_REPEAT` and it seems to be the about the same as most of the proposed solutions. I hoped there would be a more generic solution, but I suppose this is not possible, because you can't write recursive macros... — Karsten, Jan 30 '15 at 12:17
Possible duplicate of [Writing a while loop in the C preprocessor](https://stackoverflow.com/questions/319328/writing-a-while-loop-in-the-c-preprocessor) — Cristian Ciupitu, Apr 27 '18 at 13:18

score 31 · Accepted Answer · answered Jan 30 '15 at 08:58

31

You can use templates to unroll. See the disassembly for the sample Live on Godbolt

enter image description here

But -funroll-loops has the same effect for this sample.

Live On Coliru

template <unsigned N> struct faux_unroll {
    template <typename F> static void call(F const& f) {
        f();
        faux_unroll<N-1>::call(f);
    }
};

template <> struct faux_unroll<0u> {
    template <typename F> static void call(F const&) {}
};

#include <iostream>
#include <cstdlib>

int main() {
    srand(time(0));

    double r = 0;
    faux_unroll<10>::call([&] { r += 1.0/rand(); });

    std::cout << r;
}

answered Jan 30 '15 at 08:58

sehe

374,641
47
450
633

Showing how to do this without evil MACROs. And showing that `-funroll-loops` is capable of the same (depending on code) – sehe Jan 30 '15 at 09:02
Definitely an interesting approach, but I doubt my compiler supports this. I'll report back when I tried it. – Karsten Jan 30 '15 at 09:16
Just tested it: got a 'expected primary-expression before '[' token' compiler error. So I guess I can't use this although it seems to be the most generic approach. – Karsten Jan 30 '15 at 12:23
@Karsten enable c++11 (or c++0x) if your compiler has it. All compiler since ~2010 should do – sehe Jan 30 '15 at 12:24
Unfortunately I am bound to the compiler I am currently using and it does not support c++11. – Karsten Jan 30 '15 at 12:26
I accept this answer nonetheless, because it seems to be the best solution in the general case. – Karsten Jan 30 '15 at 12:30
@Karsten add `-ftree-vectorize` and you might get vectorization as well! – Mgetz Jan 30 '15 at 12:33
3

Hmm, performance is critical enough to sink to macro hacks, but not critical enough to compile even one source with a newer compiler. – Potatoswatter Jan 30 '15 at 13:40
many processors have a program cache if you keep the scope of the executable loop within the confines of the cache, (even better, within the confines of a single row of the cache) then you are eliminating any instruction memory fetches (and with care may even be able to eliminate any data memory accesses) This will (in general) result in the fastest executing code. then, some loop unrolling (keeping in mind the above goals) will eliminate (or minimize) the code executed to perform the checking for loop exit and jump back to the top of the loop. – user3629249 Feb 01 '15 at 02:22
continuing.. some processors have a large enough pipeline that the complete loop can be contained within the pipeline. Then even the cache is not accessed. This can result in no wait states and no pipeline stalls. The result is the max possible execution rate for the processor – user3629249 Feb 01 '15 at 02:24
3

@user3629249: I know this is true for most processors (especially x86), but the processor I am working with is a special ASIP without branch prediction. Therefore EVERY iteration needs at least 3 extra cycles (increment, branch and 1 pipeline stall because of the branch). If as in my case the loop contains only 1 to 8 instructions this is a considerable slow down. – Karsten Feb 05 '15 at 09:30
3

@Potatoswatter: I can't use a newer compiler because there exists none for my processor. – Karsten Feb 05 '15 at 09:31

score 16 · Answer 2 · answered Jan 30 '15 at 08:40

16

You can use the pre-processor and play some tricks with token concatenation and multiple macro expansion, but you have to hard-code all possibilities:

#define M_REPEAT_1(X) X
#define M_REPEAT_2(X) X X
#define M_REPEAT_3(X) X X X
#define M_REPEAT_4(X) X X X X
#define M_REPEAT_5(X) X M_REPEAT_4(X)
#define M_REPEAT_6(X) M_REPEAT_3(X) M_REPEAT_3(X)

#define M_EXPAND(...) __VA_ARGS__

#define M_REPEAT__(N, X) M_EXPAND(M_REPEAT_ ## N)(X)
#define M_REPEAT_(N, X) M_REPEAT__(N, X)
#define M_REPEAT(N, X) M_REPEAT_(M_EXPAND(N), X)

And then expand it like this:

#define THREE 3

M_REPEAT(THREE, three();)
M_REPEAT(4, four();)
M_REPEAT(5, five();)
M_REPEAT(6, six();)

This method requires literal numbers as counts, you can't do something like this:

#define COUNT (N + 1)

M_REPEAT(COUNT, stuff();)

answered Jan 30 '15 at 08:40

M Oehm

28,726
3
31
42

2

What would be the benefit of the macros `THREE` and `COUNT`? – harper Jan 30 '15 at 08:45
1

@harper: If you have many loops to unroll that way and all use the same repetition count, you could have one global definition. When you tune your performance, you just need to change the count in one place. That place might even be external, i.e. compiler options or a Makefile. But mainly I've incorporated it, because the OP used this set-up in his example. – M Oehm Jan 30 '15 at 08:50
Ah, I see. It has nothing to do with the particular macros, but one should always reduce use of nuneric literals and replace them with symbols. – harper Jan 30 '15 at 09:33
You use numerous expansions. Two expansions are sufficient: `#define M_REPEAT_(N, X) M_REPEAT ## N(X)` + `#define M_REPEAT(N, X) M_REPEAT_(N, X)` if you want to use another macro as counter. – Knut Apr 28 '16 at 08:51

Persixty · Answer 3 · 2015-01-30T08:51:16.853

There's no standard way of doing this.

Here's a slightly bonkers approach:

#define DO_THING printf("Shake it, Baby\n")
#define DO_THING_2 DO_THING; DO_THING
#define DO_THING_4 DO_THING_2; DO_THING_2
#define DO_THING_8 DO_THING_4; DO_THING_4
#define DO_THING_16 DO_THING_8; DO_THING_8
//And so on. Max loop size increases exponentially. But so does code size if you use them. 

void do_thing_25_times(void){
    //Binary for 25 is 11001
    DO_THING_16;//ONE
    DO_THING_8;//ONE
    //ZERO
    //ZERO
    DO_THING;//ONE
}

It's not too much to ask of an optimizer to eliminate dead code. In which case:

#define DO_THING_N(N) if(((N)&1)!=0){DO_THING;}\
    if(((N)&2)!=0){DO_THING_2;}\
    if(((N)&4)!=0){DO_THING_4;}\
    if(((N)&8)!=0){DO_THING_8;}\
    if(((N)&16)!=0){DO_THING_16;}

score 3 · Answer 4 · answered Jan 30 '15 at 08:42

You can't use a #define construct to calculate the "unroll-count". But with sufficient macros you can define this:

#define LOOP1(a) a
#define LOOP2(a) a LOOP1(a)
#define LOOP3(a) a LOOP2(a)

#define LOOPN(n,a) LOOP##n(a)

int main(void)
{
    LOOPN(3,printf("hello,world"););
}

_{Tested with VC2012}

score 0 · Answer 5 · answered Jan 30 '15 at 08:37

You can't write real recursive statements with macros and I'm pretty sure you can't have real iteration in macros as well.

However you can take a look at Order. Although it is entirely built atop the C preprocessor it "implements" iteration-like functionalities. It actually can have up-to-N iterations, where N is some large number. I'm guessing it's similar for "recursive" macros. Any way, it is such a borderline case that few compilers support it (GCC is one of them, though).

Self-unrolling macro loop in C/C++

5 Answers5

Linked

Related