7

I've got a vector of 10000 random numbers (mod 100) and I'd like to count how many pairs of two of those numbers sum to 100. I've written the following:

auto noPairsSumTo100 = 0;
const auto itEnd = end(myNums);
for (auto it1 = begin(myNums); it1 != itEnd ; ++it1) {
  for (auto it2 = it1; it2 != itEnd ; ++it2) {
    if (*it1 + *it2 == 100) {
      noPairsSumTo100 ++;
    }
  }
}

On my machine this takes about 21.6 seconds to run in debug mode. If I set _ITERATOR_DEBUG_LEVEL=0 (which sets both _SECURE_SCL and _HAS_ITERATOR_DEBUGGING to 0) the execution time is reduced to ~9.5 seconds. Replacing the != comparisons with < reduces the time further to ~8.5 seconds.

If I implement the same algorithm by indexing the vectors like this:

auto noPairsSumTo100 = 0;
const auto itEnd = end(myNums);
for (auto index1 = 0; index1 < noTerms; ++index1) {
  for (auto index2 = index1; index2 < noTerms; ++index2) {
    if (myNums[index1] + myNums[index2] == 100) {
      noPairsSumTo100 ++;
    }
  }
}

It takes about 2.1 seconds to run in debug mode. I think this is as close as I can make the algorithms aside from iterator usage. My question is, what makes the first implementation take ~4 times longer than the second?

Note, both versions of the algorithm take about 34 milli-seconds to run in release mode, so the difference is optimised out.

eoinmullan
  • 1,157
  • 1
  • 9
  • 32
  • Well, it is a feature to remind you to avoid O(n^2) algorithms for large values of n. If you want to find out what checked iterators do then just have a look. You have the source code, you can step through it with the debugger. – Hans Passant Nov 02 '13 at 01:19
  • possible duplicate of [Visual Studio debug iterators](http://stackoverflow.com/questions/6103314/visual-studio-debug-iterators) – kfsone Nov 02 '13 at 05:05
  • @kfsone that questions and Stephan's blog post explain the difference between _SECURE_SCL and _HAS_ITERATOR_DEBUGGING. I'm using _ITERATOR_DEBUG_LEVEL which has been introduced since then and superseeds them both. My question is why, in the above example, do the #defines make such a big difference, ~2200x. – eoinmullan Nov 02 '13 at 10:44
  • @HansPassant I've walked through the source and I can see bounds checking and checking if the parent vector still exists. I guess I didn't expect this to make such a drastic difference. The fact that it's an O(n^2) algorithm means a small change in n makes a big difference to the total executaion time, but it doesn't the ~2200x difference in debug and release. It's still an O(n^2) algorithm in release mode. Or am I missing something there? – eoinmullan Nov 02 '13 at 10:57
  • 2
    Iterators were crafted to collapse to a simple pointer after the compiler's optimizer is done with them, needing a fraction of a nanosecond to increment, compare and dereference. This of course does not happen in the Debug build, you pay for the many method calls and iterator checks. Use small data sets to debug code. – Hans Passant Nov 02 '13 at 11:05
  • On a completely different topic, I can think of an `O(N)` (as opposed to `O(N^2)`) way to implement this task. Can you? – Joe Z Nov 02 '13 at 18:10
  • 1
    Regarding the slowdown: What happens if you write `++it1` and `++it2` instead of `it1++` and `it2++`? Post-increment implies a temporary object creation, and in debug mode it's entirely possible none of that gets optimized away. So, you end up with several million temp objects created and destroyed... – Joe Z Nov 02 '13 at 18:14
  • There is a table in [the documentation for `_ITERATOR_DEBUG_LEVEL`](http://msdn.microsoft.com/en-us/library/vstudio/hh697468.aspx) that explains exactly how the values of that macro map to `_SECURE_SCL` and `_HAS_ITERATOR_DEBUGGING`. – James McNellis Nov 02 '13 at 19:24
  • 2
    The bounds checks tend to produce assembly where the correct usage results in a branch miss (that is `if index < size() then forwardBranch`), plus you're using post-increment rather than pre-increment http://stackoverflow.com/questions/24901/is-there-a-performance-difference-between-i-and-i-in-c. lastly, you're using two non-constant loop constraints. BAM. (Remember, debug build) – kfsone Nov 03 '13 at 00:41
  • 1
    Using `++it1`and `++it2` along with constant loop constraints gives a ~3.5x speedup, I should have done this originally as they have nothing to do with debug iterators. Using _ITERATOR_DEBUG_LEVEL=0 gives a further ~2.3x speedup. I've updated the question. – eoinmullan Nov 05 '13 at 21:59
  • @kfsone I've updated the question. Would that branch miss apply equally to both algorithms? – eoinmullan Nov 05 '13 at 22:02
  • On x86 or x64 architecture, the predictor is optimized to favor short, backwards branches that you would expect in a loop. The branch conditions in the debug iterators (`if (index >= size()) Herp();`) tends to get compiled as the assembly equivalent of `if index is valid then skip a few instructions otherwise Herp()` which results in a branch miss in the nominal case. – kfsone Nov 06 '13 at 04:41
  • @kfsone: It may be a branch miss the first time, but the branch predictor shouldn't make that mistake more than once. – Billy ONeal Nov 19 '13 at 23:43
  • @BillyONeal branch prediction on intel architecture is dependent on the direction. http://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-mispredicts `A forward branch defaults to not taken A backward branch defaults to taken` and both GCC and MSVC tend to take `if(condition)action;` as a likely case. E.g. `if(index > size)` will compile as `jbe`. So writing `if(exceptionalCase) return;` introduces a frequent branch miss. – kfsone Nov 20 '13 at 00:49
  • @kfsone: forward branch *defaults* to not taken. Not *is always assumed to be* taken. Branch predictors are smarter than that. – Billy ONeal Nov 20 '13 at 00:55
  • Somewhat true, but if it's your nominal case, why make it so you get a branch miss every time your BTB coverage expires, or so that inlinings of your code suffer the problem every time they get hit? Or so it might encourage the compiler to emit a [bad] static hint (MSVC still has no equivalent of __builtin_expect). If you need to be optimizing branches, do it right. And we *were* talking about optimization tweaks here. – kfsone Nov 20 '13 at 01:20

2 Answers2

2

Bounds checking aside, debug builds of STL code produces an insane amount of function calls. Some innocent lines like:

if (a.empty())

can produce as much as 8 (or more) function calls. Some (all?) STL implementations are not optimized for debug builds at all.

A common performance issue with STL is that devs think that function inlining always works. It doesn't. If too many inline functions are called the bottom ones do not get inlined and you have a massive performance hit just by the function calls overhead. This is very common when having containers of containers:

map< string, map< int, string>>

operations on the outer map can cause inline functions to stay as normal functions.

egur
  • 7,830
  • 2
  • 27
  • 47
  • Yes, this seems to be a considerable factor. There are two function calls per `operator<`, two per `operator++`, and two per `operator*`. That's 8 function calls per iteration of the inner loop. I've done a bench test to see the difference 8 function calls makes in an otherwise empty loop and the extra time is close to what I'm seeing in my original question. Good answer, thanks. – eoinmullan Nov 20 '13 at 23:38
1

How about 21s down to less than 3s with just one change.

In your project properties config set C/C++->Code Generation->Basic Runtime Checks to Default.

Compare the generated code or read about it in this blog post.

(This is still relevant for VS 2013/2015.)

Nick Westgate
  • 3,088
  • 2
  • 34
  • 41