C compilers and loop optimisation

Question

I don't have a lot of experience with how compilers actually optimise, and what the difference is between the different levels (-O2 vs -O3 for gcc for example). As such, I'm unsure whether the following two statements are equivalent for an arbitrary compiler:

for(i=0;i<10;++i){
variable1*variable2*gridpoint[i];
}

and

variable3=variable1*variable2;
for(i=0;i<10;++i){
variable3*gridpoint[i];
}

From a processing-time point of view it would make sense to only compute the product of variable1 and variable2 once as they don't change in the loop. This requires extra memory however, and I'm not sure how strongly an optimiser factors this overhead in. The first expression is the easiest to read if you have an equation from a paper/book and want to translate it to something computer-readable, but the second might be the fastest - especially for more complicated equations with a lot of unchanged variables within the loop (I have some pretty nasty non-linear differential equations which I would like to be human readable in the code). Does any of this change if I declare my variables as constants? I hope my question makes sense for an arbitrary compiler since I use both gcc, Intel and Portland compilers.

In this case, unless a variable is marked volatile, the whole loop will be eliminated as there are no side effects in the expression in the body of the loop. — Jonathan Leffler, Jan 09 '13 at 16:45
Or a dead code elimination as two of the above snippets can be safely eliminated from a program as they do nothing :) — , Jan 09 '13 at 16:45
For the record, a `volatile` variable is a variable that is associated with a region of memory space that can change outside of the normal flow of the code... for instance, a variable that maps to an input register for a microprocessor, or a variable that maps to a clock register. Thus, each time you access it, it may have changed, even though no code could have assigned a value to it. Thus, a compiler won't factor it out to the outside of a loop. — Dancrumb, Jan 09 '13 at 16:56
@Dancrumb Thanks for your above definition of volatile. This helped me avoid going for another search for the significance of volatile keyword. :) — sr01853, Jan 09 '13 at 17:29
@user - Just to show what compilers do, look at [this other answer](http://stackoverflow.com/a/11639305/597607) where 10 lines of inlined function calls and templates turn into just 4 or 5 machine instructions. A compiler can optimize loops in its sleep! — Bo Persson, Jan 09 '13 at 17:36
The answer is that the compiler "may" do any of these optimizations but is not required to and may not do any of them for various reasons. Instead of assuming that something will be optimized out, write the best code that you can. — Variable Length Coder, Jan 09 '13 at 18:18

score 4 · Accepted Answer · answered Jan 09 '13 at 17:47

It is hard to answer this question adequately for an arbitrary compiler. What could be done with this code depends not only on a compiler, but also on a target architecture. I'll try to explain what a production compiler with good capabilities could do to this code.

From a processing-time point of view it would make sense to only compute the product of variable1 and variable2 once as they don't change in the loop.

You are right. And as Mr. Cat has pointed out, this is called common subexpression elimination. Thus, compiler may generate the code computes the expression only once (or even computes it in compile-time if values for two operands are known to be constant at a time).

A decent compiler may also perform subexpression elimination on functions if it can determine that functions have no side-effects. GCC, for example, can analyze a function if its body is available, but there are also pure and const attributes that can be used to specifically mark functions that should be subject to this optimization (see Function Attributes).

Given that there is no side effect and compiler is able to determine it (in your example, there is nothing standing in a way), two of the snippets are equivalent in this regard (I've checked with clang :-)).

This requires extra memory however, and I'm not sure how strongly an optimiser factors this overhead in.

In fact, this doesn't require any extra memory. The multiplication is done in processor registers and a result is stored in a register as well. It is a matter of eliminating a lot of code and using a single register to store a result, which is always great (and most certainly makes life easier when it comes to register allocation, especially in a loop). So if this optimization can be done then it will be done at no extra cost.

The first expression is the easiest to read..

Both GCC and Clang will perform this optimization. I am not sure about other compilers, though, so you will have to check for yourself. But it is hard to imagine any good compiler that is not doing subexpression elimination.

Does any of this change if I declare my variables as constants?

It may. This is called a constant expression — an expression that contains only constants. A constant expression can be evaluated during compilation rather than at run time. So, for example, if you multiple A, B and C where both A and B are constants, compiler will pre-compute A*B expression only multiple C against that pre-computed value. Compilers may also do this even with non-constant values if they can determine its value at compile-time and be sure that it is not changed. For example:

$ cat test.c
inline int foo(int a, int b)
{
    return a * b;
}

int main() {
    int a;
    int b;
    a = 1;
    b = 2;
    return foo(a, b);
}
$ clang -Wall -pedantic -O4 -o test ./test.c
$ otool -tv ./test
./test:
(__TEXT,__text) section
_main:
0000000100000f70    movl    $0x00000002,%eax
0000000100000f75    ret

There are also other optimization that can take place in case of the above snippets. Below are some of them that come to mind:

The first an most obvious is loop unrolling. Since number of iterations is known at run-time, compiler may decide to unroll the loop. Whether this optimization is applied or not depends on the architecture (i.e. some CPUs can "lock on your loop" and execute the code faster than its unrolled version, which also makes the code more cache friendly by using less space, avoiding extra µOP fusion stages, etc).

The second optimization that can literally speed things up like 50 times is using SIMD instruction (SSE, AVX etc). For example, GCC is very good at it (Intel must be too, if not better). I have verified that the following function:

uint8_t dumb_checksum(const uint8_t *p, size_t size)
{
    uint8_t s = 0;
    size_t i;
    for (i = 0; i < size; ++i)
        s = (uint8_t)(s + p[i]);
    return s;
}

... is transformed into a loop where each step sums 16 values at once (i.e. as in _mm_add_epi8) with an additional code handling alignment and odd (<16) iterations count. Clang, however, have totally failed at this last time I checked. So GCC may reduce your loop that way, too, even if number of iterations are not known.

And if I may I'd like to suggest you do not optimize your code unless you find it to be a bottleneck. Otherwise you may waste hell of a lot of time doing false and premature optimization.

I hope this answers your questions. Good Luck!

"This is called a constant expression" I think the OP rather meant if (s)he declared `variable1` etc. `const`-qualified. — Daniel Fischer, Jan 09 '13 at 20:23
Thank you for this very instructive answer. I did, however, mean declaring the variables as "const float variable1" etc, so not (necessarily) compile-time constants. Can I safely assume that cleaner notation will also be optimised? For example defining "const float a=NastyObject->TediousVariableName" and using the variable a instead of the tedious variable name? I can't see why the compiler wouldn't recognize this as a simple alias — user787267, Jan 10 '13 at 09:43
@user787267: Yes, this could also help to optimize. The reason for that is sometimes compiler cannot determine that there are no side effects. For instance, if you use `obj->value` and do something else (i.e. call some function and pass `obj` into it), compiler will not know whether a value is changing or not, and will always read it from memory, whereas if you explicitly do `const type var = obj->var;`, it can keep this value in a register instead of reading memory, and do other optimizations. If however there are no side effects and compiler can determine it, then it doesn't matter. — , Jan 10 '13 at 13:54

hexist · Answer 2 · 2013-01-09T17:47:10.350

Yes, you can count on compilers to do a good job at performing sub expression elimination, even through loops. This can result in a slight memory usage increase, however all of that will be considered by any decent compiler, and it's pretty much always the case that it's a win to perform sub expression elimination (since the memory we're talking about is registers and L1 cache).

Here are some quick tests to "prove" it to yourself too. The results indicate that you should basically not try and outsmart the compiler doing manual sub expression elimination, just code naturally and let the compiler do what it's good at (which is stuff like figuring out which expressions should really be eliminated and which shouldn't given target architecture and surrounding code.)

Later, if you are unsatisfied with the performance of your code, you should take a profiler to your code and see what statements and expressions are eating up the most time and then attempt to figure out whether or not you can reorganize the code to help the compiler out, but I'd say the vast majority of the time it wont be simple things like this, it'll be doing things to reduce cache stalls (ie organizing your data better), eliminating redundant inter-procedural calculations, and stuff like that.

(FTR the use of randoms in the following code just ensures the compiler can't get too zealous about variable elimination and loop unrolling)

prog1:

#include <stdlib.h>
#include <time.h>

int main () {
    srandom(time(NULL));
    int i, ret = 0, a = random(), b = random(), values[10];
    int loop_end = random() % 5 + 1000000000;
    for (i=0; i < 10; ++i) { values[i] = random(); }

    for (i = 0; i < loop_end; ++i) {
        ret += a * b * values[i % 10];
    }

    return ret;
}

prog2:

#include <stdlib.h>
#include <time.h>

int main () {
    srandom(time(NULL));
    int i, ret = 0, a = random(), b = random(), values[10];
    int loop_end = random() % 5 + 1000000000;
    for (i=0; i < 10; ++i) { values[i] = random(); }

    int c = a * b;
    for (i = 0; i < loop_end; ++i) {
        ret += c * values[i % 10];
    }

    return ret;
}

And here are the results:

> gcc -O2 prog1.c -o prog1; time ./prog1  
./prog1  1.62s user 0.00s system 99% cpu 1.630 total

> gcc -O2 prog2.c -o prog2; time ./prog2
./prog2  1.63s user 0.00s system 99% cpu 1.636 total

(This is measuring wall time, so don't pay attention to the 0.01 second difference, running it a few times they both range in the 1.62-1.63 second range, so they're the same speed)

Interestingly enough, prog1 was faster when compiled without optimization:

> gcc -O0 prog1.c -o prog1; time ./prog1  
./prog1  2.83s user 0.00s system 99% cpu 2.846 total

> gcc -O0 prog2.c -o prog2; time ./prog2 
./prog2  2.93s user 0.00s system 99% cpu 2.946 total

Also interesting, compiling with -O1 provided the best performance..

gcc -O1 prog1.c -o prog1; time ./prog1 
./prog1  1.57s user 0.00s system 99% cpu 1.579 total

gcc -O1 prog2.c -o prog2; time ./prog2
./prog2  1.56s user 0.00s system 99% cpu 1.563 total

GCC and Intel are great compilers and are pretty smart about handling stuff like this. I don't have any experience with the Portland compiler, but these are pretty basic things for a compiler to do, so I'd be very surprised if it didn't handle situations like this well.

Those are the optimization flags for GCC. -O0 performs no major optimizations, -O1 does some optimizing, -O2 does even more optimizations, etc.. — hexist, Jan 09 '13 at 17:41
Is it not always good to use O2 optimization. What are the cases when O0 optimization can be useful . . ? Thanks. — sr01853, Jan 09 '13 at 17:53
O0 compiles the fastest and is the easiest to debug. O2 can end up optimizing out variable definitions and inline code, so when viewed with a debugger sometimes you wont be able to inspect the values of some variables and that sort of thing. O0 doesn't do any of that. — hexist, Jan 09 '13 at 18:06
Thank you for your answer. Any guesses as to why -O1 is the fastest in this case? And why prog1 is the fastest without optimisation? — user787267, Jan 10 '13 at 09:14
We'd have to dig through the output assembly to really figure out what's going on with O1 vs O2 and our O0 case. It is known though that sometimes optimizations can have a detrimental effect, but most of the time they do more good than harm (we hope!). If I had to guess it is probably something subtle like an instruction ordering thing that the cpu was able to take advantage of, it's a very minor difference in speed. Don't take that as a rule of thumb by the way, in production code O2 or O3 will still likely produce better code, but you may want to check :). — hexist, Jan 10 '13 at 15:18

abelenky · Answer 3 · 2013-01-09T17:48:25.460

0

If I were a compiler, I would recognize that both those loops have no left-hand operands, and no side effects at all, (other than setting i to 10), so I'd just optimize the loops out completely.

I'm not saying this actually happens; it just looks like it could happen from the code you provided.

edited Jan 09 '13 at 17:48

answered Jan 09 '13 at 17:43

abelenky

63,815
23
109
159

1

It certainly does happen :) But I think he was just trying to provide a simple example... – hexist Jan 09 '13 at 17:44
I was indeed just trying to make a simple example. I guess it would have made a bit more sense if I actually had a left hand side inside the loops. – user787267 Jan 10 '13 at 09:46

C compilers and loop optimisation

3 Answers3