g++ , range based for and vectorization

Question

considering the following range based for loop in C++ 11

for ( T k : j )
{
  ...
}

there are g++ or clang++ optimization flags that can speed up the compiled code ?

I'm not talking about any for cycle I'm only considering this new C++11 construct.

As range-based for loops are syntactic sugar, maybe you can try by unrolling loops. See this [question](http://stackoverflow.com/questions/10678419/c-2011-range-based-loop-unrolling) — teh internets is made of catz, May 22 '13 at 09:47
How to speed up that loop would depend A LOT on what the `...` bit is. — Mats Petersson, May 22 '13 at 09:49
what type is `T`? Maybe `for ( T& k : j )` could speed things up. — Arne Mertz, May 22 '13 at 10:22
@MatsPetersson can you explain and expand a little bit more what you are saying ? Can you categorize this scenarios ? — user2384250, May 22 '13 at 10:34
@tehinternetsismadeofcatz I'm not sure about that, the fact that it's just sugar, I think that the standard sees it as a new construct, if some compiler offers a particular intrepretation I don't think that it changes the standard. But it's always nice to see the real implementation of things. — user2384250, May 22 '13 at 10:39
@ArneMertz the simplest thing in the world and I totally forgot about that :D, thanks, but I would still like to focus on how to speedup this for cycle. — user2384250, May 22 '13 at 10:40
@user2384250 If you have a look at the gcc page I linked in my answer, you will find the "scenarios" that Mats is talking about :) — Morwenn, May 22 '13 at 17:15

Morwenn · Answer 1 · 2013-05-22T11:52:55.683

GCC documentation about auto-vectorization doesn't mention anything about the range-based for loop. Also, its code boils down to:

{
    auto && __range = range_expression ;
    for (auto __begin = begin_expr,
                __end = end_expr;
            __begin != __end; ++__begin) {
        range_declaration = *__begin;
        loop_statement
    }
}

So, technically speaking, any flag helping to auto-vectorize the constructs in this kind of regular for should auto-vectorize a similar range-based for loop. I really do this compilers only translate range-based for loops to regular for loops, then let the auto-vectorization do its job on these old loops. Which means that there is no need for a flag to tell your compiler to auto-vectorize your range-based for loops in a any scenario.

Since GCC's implementation was asked for, here is the relevant comment in the source code describing what is actually done for the range-based for loop (you can check the implementation file parser.c if you want to have a look at the code):

/* Converts a range-based for-statement into a normal
   for-statement, as per the definition.

      for (RANGE_DECL : RANGE_EXPR)
    BLOCK

   should be equivalent to:

      {
    auto &&__range = RANGE_EXPR;
    for (auto __begin = BEGIN_EXPR, end = END_EXPR;
          __begin != __end;
          ++__begin)
      {
          RANGE_DECL = *__begin;
          BLOCK
      }
      }

   If RANGE_EXPR is an array:
    BEGIN_EXPR = __range
    END_EXPR = __range + ARRAY_SIZE(__range)
   Else if RANGE_EXPR has a member 'begin' or 'end':
    BEGIN_EXPR = __range.begin()
    END_EXPR = __range.end()
   Else:
    BEGIN_EXPR = begin(__range)
    END_EXPR = end(__range);

   If __range has a member 'begin' but not 'end', or vice versa, we must
   still use the second alternative (it will surely fail, however).
   When calling begin()/end() in the third alternative we must use
   argument dependent lookup, but always considering 'std' as an associated
   namespace.  */

As you can see, they do nothing more than what the standard is actually describing.

I don't think the standard actually means to say that ranged based for loops have to look that way, I think the point was that this is their behavior. (The wording is a bit confusing "In each case, a range-based for statement is equivalent to[...]") — Cubic, May 22 '13 at 10:27

score 3 · Accepted Answer · answered May 22 '13 at 10:52

Optimizing loops is very rarely about optimizing the actual loop iteration code (for ( T k : j ) in this case), but very much about optimizing what is IN the loop.

Now, since this is ... in this case, it's impossible to say if, for example, unrolling the loop will help, or declaring functions inline [or simply moving them so that the compiler can see them and put them inline], using auto-vectorization, or perhaps using a completely different algorithm inside the loop.

The examples in the paragraph above in a bit more detail:

Unrolling the loop - essentially do several of the loop iterations without going back to the start of the loop. This is most helpful when the loop content is very small. There is automatic unrolling, where the compiler does the unrolling, or you can unroll the code manually, by simply doing, say, four items in each loop iteration and then stepping four items forward in each loop variable update or updating the iterator multiple times during the loop itself [but this of course means not using the range-based for-loop].
Inline functions - the compiler will take (usually small) functions and place them into the loop itself, rather than having the call. This saves on the time it takes for the processor to call out to another place in the code and return back. Most compilers only do this for functions that are "visible" to the compiler during compilation - so the source has to be either in the same source file, or in a header file that is included in the source file that is compiled.
Auto-vectorisation - using SSE, MMX or AVX instructions to process multiple data items in one instruction (e.g. one SSE instruction can add four float values to another four float in one instruction). This is faster than operating on a single data item at a time (most of the time, sometimes it's no benefit because of additional complications with trying to combine the different data items and then sorting out what goes where when the calculation is finished).
Choose different algorithm - there are often several ways to solve a particular problem. Depending on what you are trying to achieve, a for-loop [of whatever kind] may not be the right solution in the first place, or the code inside the loop could perhaps use a more clever way to calculate/rearrange/whatever-it-does to achieve the result you need.

But ... is far too vague to say which, if any, of the above solutions will work to improve your code.

In general terms I would like to read and know more about this subject, in practice I usually use this kind on constructs to deal with standard containers like vector and map, I'm also not sure how to turn on MMX and SSE usage and if it's possible to influence some caching activity for CPU optimization. — user2384250, May 22 '13 at 11:10
Again, that's a very wide ranging subject. `gcc` (and I believe MSVC) is capable of automatically translate to SSE instructions, but it's not always easy to "convince" the compiler that it's a good idea. It also depends on exactly what you are doing with the contents of your `vector` or `map`, and what the overhead vs. gain from combining multiple elements and processing them is. In the case of `map`, I think the compiler would find it hard to do anything meaningful, since elements that follow each other aren't (necessarily) in order in memory. `vector` is defined to have contiguous elements. — Mats Petersson, May 22 '13 at 11:36
and how I can study this behaviours, what are the topics that can help me with this ? I mean there should be a topic or a group of topics that are about what you are saying and this kind of optimizations and I would like to start some interesting reading. — user2384250, May 22 '13 at 11:43
What I usually do is compile the code with different -O1, -O2, -O3 and -funroll, for example, and see what the effect is on the assembler output (-S). The optimization done by "gcc" is fairly well documented. See: http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html — Mats Petersson, May 22 '13 at 11:47

g++ , range based for and vectorization

2 Answers2

Linked