Can reducing loop times in C++ codes help increase the speed?

Question

I give the following example to illustrate my question:

void fun(int i, float *pt)
{
    // do something based on i
    std::cout<<*(pt+i)<<std::endl;
}
const unsigned int LOOP = 2000000007;

void fun_without_optmization()
{
    float *example;
    example = new float [LOOP];
    for(unsigned int i=0; i<LOOP; i++)
    {
        fun(i,example);            
    }
    delete []example;
}
void fun_with_optimization()
{
    float *example;
    example = new float [LOOP];
    unsigned int unit_loop = LOOP/10;
    unsigned int left_loop = LOOP%10;
    pt = example;
    for(unsigend int i=0; i<unit_loop; i++)
    {
        fun(0,pt); 
        fun(1,pt);  
        fun(2,pt);
        fun(3,pt);  
        fun(4,pt);
        fun(5,pt);
        fun(6,pt);  
        fun(7,pt);
        fun(8,pt);  
        fun(9,pt);
        pt=pt+10;
    }
    delete []example;
}

As far as I understand, function fun_without_optimization() and function fun_with_optimization() should perform the same. The only argument why the second function is better than the first is that the pointer calculation in fun becomes simple. Any other arguments why the second function is better?

Second function is ill-advised attempt on manual loop unrolling. Compiler will unroll this loop for you much more efficiently. — SergeyA, Mar 10 '16 at 14:23
This kind of "optimization" is basic enought to be better optimized by any compiler than a human — Mr. E, Mar 10 '16 at 14:24
As with most questions like this the best way is to profile the code and see which one is faster. — NathanOliver, Mar 10 '16 at 14:24
what do you mean with "pointer calculation in `fun` becomes simple" ? Isnt `fun` exactly the same in both cases? — 463035818_is_not_an_ai, Mar 10 '16 at 14:27
And the 2 methods are different as the last 7 float are not printed in second snippet. — Jarod42, Mar 10 '16 at 14:37
The second function has more bugs, so it must be better. (*Real* programmer's use [Duff's device](https://en.wikipedia.org/wiki/Duff's_device), of course.) — molbdnilo, Mar 10 '16 at 14:38
The title is weird. You should probably retitle to "Can reducing loop iteration counts ...". Or just "can manual loop unrolling ...". Because reducing times (i.e. time taken) by definition means faster code. :P — Peter Cordes, Mar 10 '16 at 16:03

score 6 · Answer 1 · answered Mar 10 '16 at 14:31

6

Unrolling a loop in which I/O is performed is like moving the landing strip for a B747 from London an inch eastward in JFK.

answered Mar 10 '16 at 14:31

Peter - Reinstate Monica

15,048
4
37
62

1

Peter, I strongly advice against moving the strip any further east. Might end up in Jamaica bay. – SergeyA Mar 10 '16 at 14:36

score 5 · Answer 2 · answered Mar 10 '16 at 15:06

5

Re: "Any other arguments why the second function is better?" - would you accept the answer explaining why it is NOT better?

Manually unrolling a loop is error-prone, as is clearly illustrated by your code: you forgot to process the tail left_loop.
For at least a couple of decades compiler does this optimization for you.
How do you know the optimal number of iteration to put in that unrolled loop? Do you target a specific cache size and calculate the length of assembly instructions in bytes? The compiler might.
Your messing with the otherwise clean loop can prevent other optimizations, like the use of SIMD.

The bottom line is: if you know something that your compiler doesn't (specific pattern of the run-time data, details of the targeted execution environment, etc.), and you know what you are doing - you can try manual loop unrolling. But even then - profile.

answered Mar 10 '16 at 15:06

Vlad Feinstein

10,960
1
12
27

1

Compilers tend to be bad at using multiple accumulators when unrolling, to create multiple dependency chains to hide latency. This tends to matter in FP loops (where e.g. max FMA throughput on Haswell requires keeping 10 FMAs in flight, since they have 5 cycle latency and one per 0.5c throughput). For FP loops with a loop-carried dependency chain (e.g. sum an array), it can help to manually unroll using multiple vector accumulators. – Peter Cordes Mar 10 '16 at 15:26
1

@PeterCordes - great! But that just illustrates two of my points: (1) `if you know something that your compiler doesn't` and (2) `you know what you are doing` :) But even then - do you have different builds for Haswell? What if your code will run on another processor? – Vlad Feinstein Mar 10 '16 at 15:35
On a different x86 processor, you'll still be limited by FMA throughput. It's possible that some future processors might need even more FMAs in flight than Haswell does, but [Skylake actually reduces FMA latency to 4c](http://agner.org/optimize), so it only needs 8 FP add/mul/FMAs in flight to saturate its FMA unit. And a smart compiler could always restructure things to use even more accumulators. A really smart compiler could even restructure things to use fewer accumulators (e.g. when compiling for 32bit which only has 8 vector regs), but you probably just get crappy code with spills. – Peter Cordes Mar 10 '16 at 15:48
Since you have to vectorize yourself (e.g. with AVX intrinsics) to use multiple *vector* accumulators, you need that version of the loop or function in an ifdef anyway. So for a totally different architecture, you'd fall back to the plain-C scalar code (hopefully autovectorized for ARM NEON or PPC altivec or w/e). – Peter Cordes Mar 10 '16 at 15:51

score 3 · Accepted Answer · answered Mar 10 '16 at 14:26

The technique you describe is called loop unrolling; potentially this increases performance, as the time for evaluation of the control structures (update of te loop variable and checking the termination condition) becomes smaller. However, decent compilers can do this for you and maintainability of the code decreases if done manually.

score 1 · Answer 4 · answered Mar 10 '16 at 15:33

This is an optimization technique used for parallel architectures (architectures that support VLIW instructions). Depending on the number DALU (most common 4) and ALU(most common 2) units the architecture supports, and the level of "parallelization" the code supports, multiple instructions can be executes in one cycle.

So this code:

for (int i=0; i<n;i++) //n multiple of 4, for simplicity
   a+=temp; //just a random instruction

Will actually execute faster on a parallel architecture if rewritten like:

for (int i=0;i<n ;i+=4)
{
    temp0 = temp0 +temp1; //reads and additions can be executed in parallel
    temp1 = temp2 +temp3;
    a=temp0+temp1+a;
}

There is a limit to how much you can parallelize your code, a limit imposed by the physical ALUs/DALUs the CPU has. That's why it's important to know your architecture before you attempt to (properly) optimize your code.

It does not stop here: the code you want to optimize has to be a continuous block of code, meaning no jumps ( no function calls, no chance of flow instructions), for maximum efficiency.

Writing your code, like:

    for(unsigend int i=0; i<unit_loop; i++)
    {
        fun(0,pt); 
        fun(1,pt);  
        fun(2,pt);
        fun(3,pt);  
        fun(4,pt);
        fun(5,pt);
        fun(6,pt);  
        fun(7,pt);
        fun(8,pt);  
        fun(9,pt);
        pt=pt+10;
    }

Wold not do much, unless the compiler inlines the function calls; and it looks like to many instructions anyway...

On a different note: while it's true that you ALWAYS have to work with the compiler when optimizing your code, you should NEVER rely only on it when you what to get the maximum optimization out of your code. Remember, the compiler handles 'the general case' while you are likely interested in a particular situation - that's why some compiles have special directives to help with the optimization process.

I've never heard the claim that loop unrolling was specifically for VLIW CPUs. It still reduces loop overhead on normal CPUs. Compilers for VLIW ISAs need to be good at finding parallelism on their own, and it's true that loop unrolling is a good way to go about it. — Peter Cordes, Mar 10 '16 at 15:53
@PeterCordes nowadays most architectures are parallel but if you would do the same - loop unrolling - you would probably see a decrease in performance (because of stack spills). Techniques like loop unrolling were first introduced for DSP code optimizations - hence the VLIW reference. — Pandrei, Mar 10 '16 at 16:03
It depends how you unroll. Some compilers (like clang) like to unroll by [just repeating most of the loop body, like in this case](http://stackoverflow.com/a/35893242/224132), without interleaving the iterations. So it depends on the CPU's OOO window to extract the parallelism. This doesn't increase register pressure much, if at all. Also, https://en.wikipedia.org/wiki/Loop_unrolling says that the technique has been known since 1977 at least (see the dates on the references). That's right around the dates of the earliest DSPs. They weren't VLIW in 1977, were they? — Peter Cordes, Mar 10 '16 at 16:13
Also, it's safe to assume that your compiler can inline small simple functions (unless you hide the definition in a different compilation unit). Modern compilers are quite good at that, at least gcc/clang are. Proprietary vendor-supplied compilers for uncommon architectures may suck. — Peter Cordes, Mar 10 '16 at 16:17

Can reducing loop times in C++ codes help increase the speed?

4 Answers4