Performance of for-loops with same index

Question

while coding I came across a question:

When I have to use a lot of for loops, all iterating over a different span. Is the performance (i.e. runtime) better if I just declare one variable as an index (Example I) or does it not matter at all (Example II)?

Example I:

int ind;
for(ind=0; ind < a; ind++) { /*do something*/ }
for(ind=0; ind < b; ind++) { /*to something*/ }
...
for(ind=0; ind < z; ind++) { /*to something*/ }

Example II:

for(int ind=0; ind < a; ind++) { /*do something*/ }
...
for(int ind=0; ind < z; ind++) { /*do something*/ }

Thank you for your help

If there *is* a difference (which I doubt), you'll never be able to measure it. — Roger Rowland, Mar 29 '16 at 11:28
***If*** there is a difference, it won't be measurable, but most likely the compiler should be able to generate the same code for both examples. — Some programmer dude, Mar 29 '16 at 11:29
Just think about how small any extra amount of work is compared to your `//do something`. `ind` is declared only once, but your loop runs many times. I doubt you can find any realistic example where you could measure any difference. — 463035818_is_not_an_ai, Mar 29 '16 at 11:31
Looking at the generated assembly code will tell you that the two are identical. Or that one has some additional instructions that are not present in the other. But if they are just "different", how will you tell which is faster? — Martin Bonner supports Monica, Mar 29 '16 at 11:32
I doubt if there is any difference. I would imagine a single stack allocation is made for all the local variables at the same time at function start. — Galik, Mar 29 '16 at 11:33
If you are so concerned about performance you should use pre-increment. — hlscalon, Mar 29 '16 at 11:38
@old_mountain: If she were so concerned about performance she'd make sure that the loops are merged (either the optimizer or by hand). — Benjamin Bannier, Mar 29 '16 at 11:42
@old_mountain. This only makes sense if `int` were a complex class. — Chiel, Mar 29 '16 at 12:13
@BenjaminBannier: The loop bounds are different in each loop. Also, it's sometimes better to split loops, and sometimes better to merge loops. It depends on what you're doing inside them. e.g. if one loop uses a LUT that barely fits in L1 cache, it's probably better not to fuse it with another memory-intensive loop. — Peter Cordes, Mar 29 '16 at 12:43

Richard Hodges · Accepted Answer · 2016-03-29T11:49:17.763

If you're enabling optimisations (and if you don't, any discussion about performance is moot) then it's not possible to reason about what the compiler will do in the two scenarios.

The answer will depend upon:

The toolchain
The version of the toolchain
What options the toolchain was built with
what's happening inside the loop
(related) whether the loop can be unrolled
(related) whether the loop actually needs the index (if you're just indexing into arrays, all mention of i will usually be optimised away).
...etc

Here's how to write fast code:

Write elegant code that succinctly expresses your intent.
Check that your code is elegant and that it succinctly expresses your intent.
Remove the bugs and go back to 2
Enable the optimiser.
(this bit's important) wait for users to complain that your code is too slow.
If 5 didn't happen, stop.
measure where the most time is being spent and fix that. It won't be your loop counters, I can promise you that.

For the record, you should write it this way:

for(int ind=0; ind < a; ++ind)

Because that's more elegant (scope of ind is limited), less likely to be buggy, uses pre-increment for ind (better performance if ind ever happens to become a class type) and expresses intent (ind is used for this loop).

I noticed - we crossed in the aether. Personally though, I prefer expressing the condition in the loop as `ind < a` - compilers tend not to like comparing types against values :-) — Martin Bonner supports Monica, Mar 29 '16 at 11:44

sergej · Answer 2 · 2016-03-29T12:08:31.390

4

In practice, what matters is the number of iterations and the do something complexity, not the way the index variable is defined.

Also, consider the Rules Of Optimization.

don't optimize

don't optimize yet

profile before optimizing

edited Mar 29 '16 at 12:08

answered Mar 29 '16 at 11:35

sergej

17,147
6
52
89

Lundin · Answer 3 · 2016-03-29T12:50:53.447

In ancient times when dinosaurs walked the earth, there might have been something like: "at the point when the compiler encounters a local variable declaration, allocate space for it on the stack".

This could perhaps be the rationale why ancient dinosaur C only allowed variable declarations at the top of a block: ancient dinosaur compilers needed to know all variables in advance, before generating the code.

Then somewhere around the 80s, optimizing compilers started to allocate space for the variable at the point where it was first used. Regardless of where that variable was actually declared. Not only would this reduce stack peak usage, it would also mean that the variable didn't need to be allocated at all, if the function didn't use it. Some compilers would even go crazy effective and allocate the variable inside a CPU register instead of putting it on the stack!

And since then, that's how every compiler works. So unless you have stolen a compiler from some museum, this shouldn't be something you need to ponder.

Most likely your loop iterator will be allocated in a CPU register in both examples. I would call a compiler that generated slower code for either case broken. At worst, I suppose some compilers might get a bit confused over the different variable names and perhaps use different CPU registers for every loop - which would make the disassembled C code confusing to read but will not have any impact on performance.

As others have already mentioned, the best practice is to reduce the scope of every variable as much as possible, so you should use for(int ind=0; .... This has nothing to do with efficiency, but rather readability, maintainability, avoiding unnecessary namespace pollution and so on. The only case where you need to declared the loop iterator before the loop, is when you need to keep the value after the loop has ended.

As I understand it, one reason compilers won't make code that releases a big local array as soon as it goes out of scope is that debug formats depend on knowing how big a stack frame each function makes. I read once that one use-case for `__attribute__((noinline))` in Linux (the kernel) was to avoid reserving space on the stack (and even saving/restoring some registers) when the fast path non-error case didn't need any of that. So there are cases where a compiler could make better code by not saving some regs until after a branch, but I guess it's hard to choose when that would be good. — Peter Cordes, Mar 29 '16 at 12:49

score 1 · Answer 4 · answered Mar 29 '16 at 11:29

1

The only way to tell if something makes a difference is to measure (and be aware that the answer may differ from compiler to compiler and platform to platform).

My instinct is that compilers will generate identical code for the two samples though.

answered Mar 29 '16 at 11:29

Martin Bonner supports Monica

28,528
3
51
88

score 1 · Answer 5 · edited May 23 '17 at 11:51

First of all, ind is so close to int that you typoed it in the question, so it's a bad choice of variable name. Using i as a loop index is a near-universal convention.

Any decent compiler will do lifetime analysis on the int i that's in scope for the whole function and see that the i=0 at the start disconnects it from its previous value. Uses of i after that are unrelated to uses before that, because of the unconditional assignment that didn't depend on anything computed from the previous value.

So from an optimizing compiler's perspective, there shouldn't be a difference. Any difference in the actual asm output should be considered a missed-optimization bug in whichever one is worse.

In practice, gcc 5.3 -O3 -march=haswell targeting x86-64 makes identical loops for narrow scope vs. function scope in a simple test I did. I had to use three arrays inside the loop to get gcc to use indexed addressing modes instead of incrementing pointers, which is good because one-register addressing modes are more efficient on Intel SnB-family CPUs.

It reuses the same register for i in both loops, instead of saving/restoring another call-preserved register (e.g. r15). Thus, we can see this potential worry about more variables in a function leading to worse register allocation is not in fact a problem. gcc does a pretty good job most of the time.

These are the two functions I tested on godbolt (see the link above). They both compile to identical asm with gcc 5.3 -O3.

#include <unistd.h>
// int dup(int) is a function that the compiler won't have a built-in for
// it's convenient for looking at code with function calls.

void single_var_call(int *restrict dst, const int *restrict srcA,
                     const int *restrict srcB, int a) {
    int i;
    for(i=0; i < a; i++) { dst[i] = dup(srcA[i] + srcB[i]); }
    for(i=0; i < a; i++) { dst[i] = dup(srcA[i]) + srcB[i]+2; }
}

// Even with restrict, gcc doesn't fuse these loops together and skip the first store
// I guess it can't because the called function could have a reference to dst and look at it
void smaller_scopes_call(int *restrict dst, const int *restrict srcA,
                         const int *restrict srcB, int a) {
    for(int i=0; i < a; i++) { dst[i] = dup(srcA[i] + srcB[i]); }
    for(int i=0; i < a; i++) { dst[i] = dup(srcA[i]) + srcB[i]+2; }
}

For correctness / readability reasons: prefer `for (int i=...)`

The C++ / C99 style of limiting the scope of loop variables has advantages for humans working on the code. You can see right away that the loop counter isn't used outside the loop. (So can the compiler).

It's a good way to prevent errors like initializing the wrong variable.

Performance of for-loops with same index

5 Answers5

For correctness / readability reasons: prefer for (int i=...)

For correctness / readability reasons: prefer `for (int i=...)`