How is incrementing and dereferencing the iterator faster than incrementing a separate variable and indexing it into the array?

Question

Why is the following:

for (auto & x : vec) { /* do stuff with x */ }

faster than

for (int i = 0; i < v.size(); ++i) { /* do stuff with v[i] */ }

As my title said, I was told that incrementing and dereferencing the iterator faster than incrementing a separate variable and indexing it into the array, but don't understand why?

What exactly is happening that makes it faster?

I tried to step through and see what was happening and this is what I came up with

// pseudo instructions
// iterator

increment iterator
    e.g., if the object is 3 integers you would increment the iterator
    by 12
dereference this iterator

// pseudo instructions
// array

increment array index
    add 1 to index counter
multiply by size of object 
    e.g., if the object is 3 integers, you multiply by 12
dereference the array

Iterating through an array seems rather trivial. You are just doing the following:

[offset + index * size] ; An array takes this form in assembly

Which would look something like this if you have an array of integers

[rbp - 16 + rdi * 4] ; rbp - 16 is the offset
                     ; rdi is the index
                     ; 4 is the size of the integer

The array iteration seems rather trivial, so I don't understand how dereferencing an iterator is faster.

"_I was told_": As a general statement that is almost surely wrong. Have you actually tested it for some example? — user17732522, May 25 '22 at 04:21
The reason to prefer `for (auto & x : vec)` is not speed, but correctness and readability. For example `int` is the wrong type. The vector size could be larger than fits an `int` and then `for (int i = 0; i < v.size(); ++i)` has undefined behavior. — user17732522, May 25 '22 at 04:24
One thing might be that `v.size()` must be called repeatedly if the compiler cannot prove it does not change. Although I would imagine this case is focused on by the compiler writers as it is so common. Other than that, write readable code first and for-ranges more clearly state the intent. — Quimby, May 25 '22 at 04:25
Range based for loops improve correctness (no risk of going out of bound), and that might be more important then raw performance even if the compiler will do a good job on this. Don't go by ear, test , and read the C++ core guidelines. For you this part : https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Res-for-range — Pepijn Kramer, May 25 '22 at 04:50
Array iterators are just pointers, that's why they're the same speed. (If you enable optimization and the compiler is able to widen the `int` to `intptr_t` so it doesn't have to redo sign-extension on every dereference. That will be the case here with good compilers like GCC and clang, due to signed-overflow UB. [What Every C Programmer Should Know About Undefined Behavior](http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html)) Notice that your example asm uses RDI, not EDI. Look at actual compiler output on https://godbolt.org/ instead of guessing. — Peter Cordes, May 25 '22 at 05:00
@PeterCordes Is RDI not used when indexing arraying ins assembly? I had assumed in x86-64 machines that RDI would be used. — Happy Jerry, May 26 '22 at 02:44
Yes, you would want to use 64-bit RDI in the asm, even though you declared `i` as a 32-bit integer type (`int`). That's the whole point of my last comment, that the compiler does *not* have to implement the behaviour of 32-bit wrap-around in case the `size_t` `v.size()` is larger than `INT_MAX`. Unless you compile with `-fwrapv`, in which case that's possible: the loop could be infinite, doing stuff with `v[INT_MIN` to `v[INT_MAX]`. (e.g. if `v` was a pointer to the middle of a 4 GiB array of bytes.) — Peter Cordes, May 26 '22 at 02:53

Swift - Friday Pie · Accepted Answer · 2022-05-25T06:20:18.353

The only way it MAY be faster, at least in unoptimized code is because you call size() member function at beginning of every iteration and it's really depends on container and type of iterator it uses. Otherwise using iterators for array-like container is same as using pointer arithmetics and which order is faster depends on optimization. Also it would help if i would be of appropriate type size to not confuse compiler by different width of values.

Range-based for loop got more invalidation issues than index-based though if you're in process of modifying same container. Because it coded in this way:

The range-based for statement

for ( for-range-declaration : for-range-initializer ) statement

is equivalent to

{
    auto &&__range = for-range-initializer ;
    auto __begin = begin-expr ;   
    auto __end = end-expr ;
    for ( ; __begin != __end; ++__begin ) {
        for-range-declaration = *__begin;
        statement
    }
}

As you see, iterators for begin and end of range are evaluated before loop and any actions taken upon the range which invalidates them would invoke an UB.

If we speak of generic idea of container and not just std::vector, for some containers operator[] may be way more expensive than iterator increment. Only a few number of standard containers have constant cost of operator[] and in case of views\ranges its cost is generally larger, non-constant or non-linear. Cost of incrementing iterators tends to be constant or linear. Costs are usually declared or can be found out during performance tests.

So main issue is not performance but correctness of code, and use of range-based for loop suggests that range wasn't changed in it. It makes harder to cause such change deliberately because all that you have access to is a reference or a copy of an element. An attempt to do so would be obvious, it would be some usage of original range. It also saves from boilerplate code of begin-expr/end-expr.

In common for-loop calling size in i < v.size() may suggest that size() could change during execution. If it's true and happens in middle of loop's body, you may find out that index is out of bound from that point.

When reviewing code not knowing author of code, If I look for source of such change, every such loop is a suspect in detected bug or crash on out-of-bound access from the moment I saw its first line followed by some non-transparent code.

score 5 · Answer 2 · edited May 25 '22 at 15:43

Is for (auto & x : vec) { /* do stuff with x */ } faster than for (int i = 0; i < v.size(); ++i) { /* do stuff with v[i] */ }?

When you talk about faster in absolutes you are basically always wrong. Every case is different. A simple change in the stack alignment can make as much as a 40% difference in the speed of your code, which is more than the difference between -O0 and -O2 in general. So simply adding a local variable to some function earlier in the call graph can completely reverse what is faster.

Always measure the performance of your code with real inputs!

That said, the for loop invokes v.size() in every iteration. The auto loop does not do that. The compiler may or may not be able to realize that the change of v never changes for the duration of the loop. If it doesn't that will cause slow downs.

But if it does both ways of writing the loop should turn into this (on x86_64):

size_t count = vec.end() - vec.begin();
T *it = vec.begin();
while (count-- > 0) {
    /* do stuff with *it */
    ++it;
}

That's usually the form a well optimized loop takes. Or more specifically in asm a do{}while(--count != 0) with code outside the loop to not enter it if it should run zero iterations. Or do{}while(++ptr < endptr) to use it itself as the loop counter, avoiding having to actually calculate a count out of pointer-subtraction.

Your idea that array access should take the form [rbp - 16 + rdi * 4] is false. While that is perfectly reasonable to write by hand the resulting machine code is larger than using [rdi] and add rdi, 4 once per loop. This will save instruction cache, especially if the pointer is accessed multiple times, and end up being faster. The > 0 test is also simpler than < end when you deal with an index, so if the compiler uses a separate count register, it can do dec rcx / jnz .loop_top (1 uop on Intel) instead of inc / cmp / jne (needing a 2nd register for the limit, and is 2 uops on Intel and AMD, macro-fusing the cmp+jcc).

Overall if one of the two forms is faster than the other then it is either a failure to optimize (the same way) or that v.size() call being impossible to eliminate.

How is incrementing and dereferencing the iterator faster than incrementing a separate variable and indexing it into the array?

2 Answers2