48

Reading various questions here on Stack Overflow about C++ iterators and performance**, I started wondering if for(auto& elem : container) gets "expanded" by the compiler into the best possible version? (Kind of like auto, which the compiler infers into the right type right away and is therefore never slower and sometimes faster).

** For example, does it matter if you write

for(iterator it = container.begin(), eit = container.end(); it != eit; ++it)

or

for(iterator it = container.begin(); it != container.end(); ++it)

for non-invalidating containers?

Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
TeaOverflow
  • 2,468
  • 3
  • 28
  • 40
  • 1
    How does using auto instead of typedefs help performance? – Jonathan Wakely May 30 '12 at 18:13
  • @JonathanWakely Sorry, mentioning typedefs was an error on my part. I can't find a proper reference, but what I ment is what Stephen T. Lavavej mentaions at 49:30+ in channel9.msdn.com/Events/GoingNative/GoingNative-2012/STL11-Magic-Secrets – TeaOverflow May 30 '12 at 19:25

5 Answers5

41

The Standard is your friend, see [stmt.ranged]/1

For a range-based for statement of the form

for ( for-range-declaration : expression ) statement

let range-init be equivalent to the expression surrounded by parentheses

( expression )

and for a range-based for statement of the form

for ( for-range-declaration : braced-init-list ) statement

let range-init be equivalent to the braced-init-list. In each case, a range-based for statement is equivalent to

{
  auto && __range = range-init;
  for ( auto __begin = begin-expr,
             __end = end-expr;
        __begin != __end;
        ++__begin )
  {
    for-range-declaration = *__begin;
    statement
  }
}

So yes, the Standard guarantees that the best possible form is achieved.

And for a number of containers, such as vector, it is undefined behavior to modify (insert/erase) them during this iteration.

Community
  • 1
  • 1
Matthieu M.
  • 287,565
  • 48
  • 449
  • 722
  • @Evgeni: modifying a container while looping over it is problematic in general (in many other languages also). There are usually ways to do it correctly, but they are often more difficult than a trivial range-based loop. The same goes here: you have to iterate manually and update the iterator with for example the result from `erase` – KillianDS May 30 '12 at 20:02
  • 1
    How do you conclude that it is guaranteed to be best possible performance? I'm not clear how that conclusion is reached. – thc Mar 08 '19 at 19:59
  • 2
    @thc: I am not saying best possible performance, but best possible form. Specifically, note that (1) `range-init` is only evaluated once, (2) `begin-expr` and `end-expr` are only evaluated once and (3) pre-increment is used. Algorithmitically speaking, this is the form of least complexity. Whether it optimizes to best possible performance is likely, but not guaranteed. Optimizers are fickle. – Matthieu M. Mar 09 '19 at 09:43
28

Out of curiosity I decided to look at the assembly code for both approaches:

int foo1(const std::vector<int>& v) {
    int res = 0;
    for (auto x : v)
        res += x;
    return res;
}

int foo2(const std::vector<int>& v) {
    int res = 0;
    for (std::vector<int>::const_iterator it = v.begin(); it != v.end(); ++it)
      res += *it;
    return res;
}

And the assembly code (with -O3 and gcc 4.6) is exactly the same for both approaches (code for foo2 is omitted, since it is exactly the same):

080486d4 <foo1(std::vector<int, std::allocator<int> > const&)>:
80486d4:       8b 44 24 04             mov    0x4(%esp),%eax
80486d8:       8b 10                   mov    (%eax),%edx
80486da:       8b 48 04                mov    0x4(%eax),%ecx
80486dd:       b8 00 00 00 00          mov    $0x0,%eax
80486e2:       39 ca                   cmp    %ecx,%edx
80486e4:       74 09                   je     80486ef <foo1(std::vector<int, std::allocator<int> > const&)+0x1b>
80486e6:       03 02                   add    (%edx),%eax
80486e8:       83 c2 04                add    $0x4,%edx
80486eb:       39 d1                   cmp    %edx,%ecx
80486ed:       75 f7                   jne    80486e6 <foo1(std::vector<int, std::allocator<int> > const&)+0x12>
80486ef:       f3 c3                   repz ret 

So, yes, both approaches are the same.

UPDATE: The same observation holds for other containers (or element types) such as vector<string> and map<string, string>. In those cases, it is especially important to use a reference in the ranged-based loop. Otherwise a temporary is created and lots of extra code appears (in the previous examples it was not needed since the vector contained just int values).

For the case of map<string, string> the C++ code snippet used is:

int foo1(const std::map<std::string, std::string>& v) {
    int res = 0;
    for (const auto& x : v) {
        res += (x.first.size() + x.second.size());
    }
    return res;
}

int foo2(const std::map<std::string, std::string>& v) {
    int res = 0;
    for (auto it = v.begin(), end = v.end(); it != end; ++it) {
        res += (it->first.size() + it->second.size());
    }
    return res;
}

And the assembly code (for both cases) is:

8048d70:       56                      push   %esi
8048d71:       53                      push   %ebx
8048d72:       31 db                   xor    %ebx,%ebx
8048d74:       83 ec 14                sub    $0x14,%esp
8048d77:       8b 74 24 20             mov    0x20(%esp),%esi
8048d7b:       8b 46 0c                mov    0xc(%esi),%eax
8048d7e:       83 c6 04                add    $0x4,%esi
8048d81:       39 f0                   cmp    %esi,%eax
8048d83:       74 1b                   je     8048da0 
8048d85:       8d 76 00                lea    0x0(%esi),%esi
8048d88:       8b 50 10                mov    0x10(%eax),%edx
8048d8b:       03 5a f4                add    -0xc(%edx),%ebx
8048d8e:       8b 50 14                mov    0x14(%eax),%edx
8048d91:       03 5a f4                add    -0xc(%edx),%ebx
8048d94:       89 04 24                mov    %eax,(%esp)
8048d97:       e8 f4 fb ff ff          call   8048990 <std::_Rb_tree_increment(std::_Rb_tree_node_base const*)@plt>
8048d9c:       39 c6                   cmp    %eax,%esi
8048d9e:       75 e8                   jne    8048d88 
8048da0:       83 c4 14                add    $0x14,%esp
8048da3:       89 d8                   mov    %ebx,%eax
8048da5:       5b                      pop    %ebx
8048da6:       5e                      pop    %esi
8048da7:       c3                      ret    
betabandido
  • 18,946
  • 11
  • 62
  • 76
  • 12
    The compiler optimized `v.end()` in this case, it may not always be able to do so (for other containers). – Motti May 30 '12 at 18:19
  • 1
    Like @Motti says, this is not proof. – ergosys May 30 '12 at 18:20
  • 1
    the OP also included the option to cache `v.end()` and the assembly code would still look the same. If you are more comfortable with that... – betabandido May 30 '12 at 18:23
  • Nice, thank you! Am I right with the assumpttion that they (teh codez) are the same because `std::vector` invalidates the upon insertion (reallocation, actually)? And that for other containers (e.g. `std::list`) the `end()` gets cached? – TeaOverflow May 30 '12 at 18:27
  • 3
    @Evgeni: no you are wrong. It depends on the complexity of `end`, the transparency of the loop body (if you pass a reference to the container to an opaque functions all bets are off) and whether the optimizer is good enough to deduce it or not. Despite what most books do, it is *unconditionally* better to cache the `end()` call for for-each equivalents. It also documents that `end()` won't bulge for future readers. – Matthieu M. May 30 '12 at 18:34
  • 2
    note that using std::for_each also results in the same compiled code. – Willem Hengeveld May 30 '12 at 18:50
27

Range-for is as fast as possible since it caches the end iterator[citation provided], uses pre-increment and only dereferences the iterator once.

so if you tend to write:

for(iterator i = cont.begin(); i != cont.end(); i++) { /**/ }

Then, yes, range-for may be slightly faster, since it's also easier to write there's no reason not to use it (when appropriate).

N.B. I said it's as fast as possible, it isn't however faster than possible. You can achieve the exact same performance if you write your manual loops carefully.

Motti
  • 110,860
  • 49
  • 189
  • 262
5

No. It is same as the old for loop with iterators. After all, the range-based for works with iterators internally. The compiler just produces equivalent code for both.

Nawaz
  • 353,942
  • 115
  • 666
  • 851
  • I knew it uses iterators, but what liked to know if the compiler produces the best possible version of the loop with iterators... – TeaOverflow May 30 '12 at 18:12
5

It's possibly faster, in rare cases. Since you can't name the iterator, an optimizer can more easily prove that your loop cannot modify the iterator. This affects e.g. loop unrolling optimizations.

MSalters
  • 173,980
  • 10
  • 155
  • 350