Loop optimisation techniques in C++

Question

In order to increase the performance of our applications, we have to consider loop optimisation techniques during the development phase.

I'd like to show you some different ways to iterate over a simple std::vector<uint32_t> v:

Unoptimized loop with index:

uint64_t sum = 0;
for (unsigned int i = 0; i < v.size(); i++)
    sum += v[i];

Unoptimized loop with iterator:

uint64_t sum = 0;
std::vector<uint32_t>::const_iterator it;
for (it = v.begin(); it != v.end(); it++)
    sum += *it;

Cached std::vector::end iterator:

uint64_t sum = 0;
std::vector<uint32_t>::const_iterator it, end(v.end());
for (it = v.begin(); it != end; it++)
    sum += *it;

Pre-increment iterators:

uint64_t sum = 0;
std::vector<uint32_t>::const_iterator it, end(v.end());
for (it = v.begin(); it != end; ++it)
    sum += *it;

Range-based loop:

uint64_t sum = 0;
for (auto const &x : v)
    sum += x;

There are also other means to build a loop in C++; for instance by using std::for_each, BOOST_FOREACH, etc...

In your opinion, which is the best approach to increase the performance and why?

Furthermore, in performance-critical applications it could be useful to unroll the loops: again, which approach would you suggest?

`Which is the best solution to save our performances? And why?` You tell me. _Benchmark it._ — Lightness Races in Orbit, Aug 13 '13 at 15:48
Loop unrolling is generally done by the compiler. Also, always use pre-increment operator (in case similar to this one). — Xaqq, Aug 13 '13 at 15:48
Good compilers with decent optimization should yield the same assembly for all of these. If your compiler supports a `restrict` keyword, you may get slightly better performance by grabbing vector's array and summing in a for-loop (a more C approach). You might not. Look at the assembly under optimization and see if there's any difference at all. — sfstewman, Aug 13 '13 at 15:48
@Snps: ...yet, there is no excuse for patently inefficient code. — Lightness Races in Orbit, Aug 13 '13 at 15:49
@LightnessRacesinOrbit Well, those 4 cases are all decent way to iterate over a `vector`. — Xaqq, Aug 13 '13 at 15:50
@Snps How do you know it's *premature*? The OP may very well have identified a critical bottleneck. Truism that don't answer the question are not useful. — sfstewman, Aug 13 '13 at 15:50
@sfstewman "during the development phase" -- likely premature. — Xaqq, Aug 13 '13 at 15:51
@Xaqq Depends on the kind of application. If you're writing simulation software or a game, development typically has performance criteria. Optimization at the end of development is normal. — sfstewman, Aug 13 '13 at 15:52
@Snps True, but he clearly said he has a performance problem. — James Kanze, Aug 13 '13 at 15:54
@LightnessRacesinOrbit There's no "patently inefficient code" in his examples. — James Kanze, Aug 13 '13 at 15:54
@sfstewman Agree, but those are micro-optimisation (in my opinion), and its likely that optimizing algorithm (*if possible*) would lead to a higher gain. However, what you said is true :) — Xaqq, Aug 13 '13 at 15:55
@Xaqq: When else do you optimise? At Departures as you're waving off your software package, moments before it takes its seat on the Boeing 747, travelling to its new home with your customer? — Lightness Races in Orbit, Aug 13 '13 at 15:58
@LightnessRacesinOrbit When battery of the 787 ignites, clearly. :) — sfstewman, Aug 13 '13 at 15:59
@LightnessRacesinOrbit When the software already works, and you're in need of performance improvement. The OP may be in a *very* specific case, but it's (again in my opinion / experience -- admittedly limited) unlikely that changing the way you iterate over `vector` will give you any noticeable gain. — Xaqq, Aug 13 '13 at 16:00
@Xaqq: No, I'm asking, if not "during development", then when else? By definition, if you're optimising your code, you're developing it. — Lightness Races in Orbit, Aug 13 '13 at 16:00
@LightnessRacesinOrbit Or at least you haven't finished development. (Unless you're in maintenance mode, and have a bug report: xxx is too slow.) — James Kanze, Aug 13 '13 at 16:24
@LightnessRacesinOrbit Quality assurance phase? But I guess we could call it dev phase. Packaging is also developing software right, because you're improving it (or the way you'll distribute it). Development: *An event constituting a new stage in a changing situation*. Anyway, I'm not pro at this, and I'm pretty sure you understood what I meant before. — Xaqq, Aug 13 '13 at 16:45
@Xaqq: Understanding what you hopefully meant does not preclude my responsibility to ensure that you really meant it. :) — Lightness Races in Orbit, Aug 13 '13 at 18:18
Why do you keep removing my corrections? Your code doesn't even compile without them. — Lightness Races in Orbit, Aug 13 '13 at 18:19

James Kanze · Accepted Answer · 2013-08-13T16:12:24.233

There's no hard and fast rule, since it depends on the implementation. If the measures I did some years back are typical, however: about the only thing which makes a difference is caching the end iterator. Pre- or post-fix makes no difference, regardless of the container and iterator type.

At the time, I didn't measure indexing (because I was comparing iterators of different types of container as well, and not all support indexing). But I would guess that if you use indexes, you should cache the results of v.size() as well.

Of course, these measures were for one compiler (g++) on one system, with a specific hardware. The only way you can know for your environment is to measure yourself.

RE your note: are you sure you have full optimization turned on. My measures showed no difference between 3 and 4, and I doubt that commpilers optimize less today.

It's very important for the optimizations here that the functions are actually inlined. If they're not, post-incrementation does require some extra copying, and typically will require an extra function call (to the copy constructor of the iterator) as well. Once the functions are inlined, however, the compiler can easily see that all this is a unessential, and (at least when I tried it) generate exactly the same code in both cases. (I'd use pre-incrementation anyway. Not because it makes a difference, but because if you don't, some idiots will come along claiming it will, despite your measures. Or maybe they're not idiots, but are just using a particularly stupid compiler.)

To tell the truth, when I did the measurements, I was surprised that caching the end iterator made a difference, even for vector, where as there was no difference between pre- and post-incrementation, even for a reverse iterator into a map. After all, end() was inlined as well; in fact, every single function used in my tests was inlined.

As to unrolling the loops: I'd probably do something like this:

std::vector<uint32_t>::const_iterator current = v.begin();
std::vector<uint32_t>::const_iterator end = v.end();
switch ( (end - current) % 4 ) {
case 3:
    sum += *current ++;
case 2:
    sum += *current ++;
case 1:
    sum += *current ++;
case 0:
}
while ( current != end ) {
    sum += current[0] + current[1] + current[2] + current[3];
    current += 4;
}

(This is a factor of 4. You can easily increase it if necessary.)

What level of optimization did you use? The `size` method of `std::vector` should inline at `-O2` or higher, and recent versions of GCC may be able to hoist this out of the loop (effectively caching it), as well. — sfstewman, Aug 13 '13 at 15:58
@sfstewman The maximum, according to the compiler documentation. I didn't do any tests indexing, so I didn't use `size()`, but I would have expected the same considerations to apply to `end()`. At least back then, though, explicitly caching `end()` did make things faster. — James Kanze, Aug 13 '13 at 16:14
Very nice unrolling technique (+1), but I have a question, Are unrolling techniques (as you showed) effective in modern C++ compilers? — masoud, Aug 13 '13 at 16:21
@MM. I don't know. I've never needed them recently. I think that many modern compilers can find them themselves. (But then, I would have sworn that compilers back when I did my measures would automatically cache `end()`, and they didn't.) — James Kanze, Aug 13 '13 at 16:26
It's actually better to have 4 separate sum variables for the entire loop. Then combine them only at the end. This breaks the nasty dependency chain in the middle. — Mysticial, Aug 13 '13 at 16:42
@Mysticial Excellent idea. (When I did this, dependencies weren't such an issue.) Of course, this also depends on the architecture. On an x86, you don't have many registers, and it's possible that adding additional summation variables will cause the compiler to spill them to memory, instead of keeping them all in registers during the loop. You need to measure. But it's something else to try. — James Kanze, Aug 13 '13 at 16:53

John Dibling · Answer 2 · 2013-08-13T16:40:16.473

I'm going on the assumption that you are well aware of the evils of premature micro-optimization, and that you have identified hotspots in your code by profiling and all the rest. I'm also going on the assumption that you're only concerned about performance with respect to speed. That is, you don't care deeply about the size of the resulting code or memory use.

The code snippets you have provided will yield largely the same results, with the exception of the cached end() iterator. Aside from caching and inlining as much as you can, there is not much you can do to tweak structure of the loops above to realize significant gains in performance.

Writing performant code in critical paths relies first and foremost on selecting the best algorithm for the job. If you have a performance problem, look first and hard at the algorithm. The compiler will generally do a much better job at micro-optimizing the code you wrote than you could ever hope to.

All that being said, there are a few things you can do to give your compiler a little help.

Cache everything you can
Keep small allocations to a minimum, especially within a loop
Make as many things const as you can. This gives the compiler additional opportunities to micro-optimize.
Learn your toolchain well and leverage that knowledge
Learn your architecture well and leverage that knowledge
Learn to read assembly code and examine the assembly output from your compiler

Learning your toolchain and architecture are going to yield the most benefits. For example, GCC has many options you can enable to increase performance, including loop unrolling. See here. When iterating datasets, it is often beneficial to keep each item aligned to the size of a cache line. In modern architecture this often means 64 bytes, but learn your architecture.

Here is an excellent guide to writing performant C++ in an Intel environment.

Once you have learned your architecture and toolchain, you might find that the algorithm you originally selected is not optimal in your real world. Be open to change in the face of new data.

Don't forget Agner Fog's optimization manuals: http://www.agner.org/optimize/#manuals — sfstewman, Aug 13 '13 at 16:40
@sfstewman: Oops! My second link was supposed to point there, but I mis-linked. Edited. — John Dibling, Aug 13 '13 at 16:45
"Cache everything you can" introduces redundancy, which is likely bad. — , Aug 13 '13 at 17:13

score 2 · Answer 3 · edited May 23 '17 at 11:56

It's very likely that modern compilers will produce the same assembly for the approaches you give above. You should look at the actual assembly (after enabling optimizations) to see.

When you're down to worrying about the speed of your loops, you should really think about whether your algorithm is truly optimal. If you're convinced it is, then you need to think about (and make use of) the underlying implementation of the data structures. std::vector uses an array underneath, and, depending on the compiler and the other code in the function, pointer aliasing may prevent the compiler from fully optimizing your code.

There's a fair amount of information out there on pointer aliasing (including What is the strict aliasing rule?), but Mike Acton has some wonderful information about pointer aliasing.

The restrict keyword (see What does the restrict keyword mean in C++? or, again, Mike Acton), available through compiler extensions for many years and codified in C99 (currently only available as a compiler extension in C++), is meant to deal with this. The way to use this in your code is far more C-like, but may allow the compiler to better optimize your loop, at least for the examples you've given:

uint64_t sum = 0;
uint32_t *restrict velt = &v[0];
uint32_t *restrict vend = velt + v.size();
while(velt < vend) {
  sum += *velt;
  velt++;
}

However, to see whether this provides a difference, you really need to profile different approaches for your actual, real-life problem, and possibly look at the underlying assembly produced. If you're summing simple data types, this may help you. If you're doing anything more complicated, including calling a function that cannot be inlined in the loop, it's unlikely to make any different at all.

score 2 · Answer 4 · answered Jun 20 '17 at 07:22

If you're using clang, then pass it these flags:

  -Rpass-missed=loop-vectorize
  -Rpass-analysis=loop-vectorize

In Visual C++ add this to the build:

  /Qvec-report:2

These flags will tell you if a loop fails to vectorise (and give you an often cryptic message explaining why).

In general though, prefer options 4 and 5 (or std::for_each). Whilst clang and gcc will typically do a decent job in most cases, Visual C++ tends to err on the side of caution sadly. If the scope of the variable is unknown (e.g. a reference or pointer passed into a function, or this pointer), then vectorisation often fails (containers in the local scope will almost always vectorise).

#include <vector>
#include <cmath>

// fails to vectorise in Visual C++ and /O2
void func1(std::vector<float>& v)
{
  for(size_t i = 0; i < v.size(); ++i)
  {
    v[i] = std::sqrt(v[i]);
  }
}

// this will vectorise with sqrtps
void func2(std::vector<float>& v)
{
  for(std::vector<float>::iterator it = v.begin(), e = v.end(); it != e; ++it)
  {
    *it = std::sqrt(*it);
  }
}

Clang and gcc aren't immune to these issues either. If you always take a copy of begin/end, then it cannot be a problem.

Here's another classic that affects many compilers sadly (clang 3.5.0 fails this trivial test, but it's fixed in clang 4.0). It crops up a LOT!

struct Foo
{
  void func3();
  void func4();
  std::vector<float> v;
  float f;
};

// will not vectorise
void Foo::func3()
{
  // this->v.end() !!
  for(std::vector<float>::iterator it = v.begin(); it != v.end(); ++it)
  {
    *it *= f;  // this->f !!
  }
}

void Foo::func4()
{
  // you need to take a local copy of v.end(), and 'f'.
  const float temp = f;
  for(std::vector<float>::iterator it = v.begin(), e = v.end(); it != e; ++it)
  {
    *it *= temp;
  }
}

In the end, if it's something you care about, use the vectorisation reports from the compiler to fix up your code. As mentioned above, this is basically an issue of pointer aliasing. You can use the restrict keyword to help fix some of these issues (but I've found that applying restrict to 'this' is often not that useful).

score 0 · Answer 5 · answered Oct 25 '13 at 23:48

Use range based for by default as it will give the compiler the most direct information to optimize (compiler knows it can cache the end iterator for example). Then profile and only optimize further if you identify a significant bottleneck. There will be very few real world situations where these different loop variants make a meaningful performance difference. Compilers are pretty good at loop optimization and it is far more likely that you should focus your optimization effort elsewhere (like choosing a better algorithm or focusing on optimizing the loop body).

Loop optimisation techniques in C++

5 Answers5