2

I've found this snippet while looking at <algorithm> implementations:

/**
*  @if maint
*  This is an overload used by find() for the RAI case.
*  @endif
*/
template<typename _RandomAccessIterator, typename _Tp>
  _RandomAccessIterator
  find(_RandomAccessIterator __first, _RandomAccessIterator __last,
  const _Tp& __val, random_access_iterator_tag)
{
  typename iterator_traits<_RandomAccessIterator>::difference_type
__trip_count = (__last - __first) >> 2;

  for ( ; __trip_count > 0 ; --__trip_count)
{
  if (*__first == __val)
    return __first;
  ++__first;

  if (*__first == __val)
    return __first;
  ++__first;

  if (*__first == __val)
    return __first;
  ++__first;

  if (*__first == __val)
    return __first;
  ++__first;
}

  switch (__last - __first)
{
case 3:
  if (*__first == __val)
    return __first;
  ++__first;
case 2:
  if (*__first == __val)
    return __first;
  ++__first;
case 1:
  if (*__first == __val)
    return __first;
  ++__first;
case 0:
default:
  return __last;
}
}

For my understanding the only "trick" here is to perform distance / 4 iterations with four if(...) return; ++ and the same with the remaining distance % 4 items inside a switch. So as expected, exactly the same number of comparisons and incrementation is performed, and theoretical complexity is the same. How is it a better optimization than the trivial input iterator implementation? Is it a micro-optimization to lower the number of loop iterations or is there something smarter I don't get?

FredericS
  • 413
  • 4
  • 9
  • 2
    I'd call it a prime example of why you should use those instead of rolling your own. That's not to say I know why it's written this way, but CPUs are complex and this likely appeases a specific aspect, much like how branch prediction fail can make an algorithm of the same complexity fail miserably. – ghostofstandardspast Jun 18 '14 at 22:27
  • It's manually unrolling the loop. Presumably, the implementer did some testing to figure out that this was beneficial compared to having a single `for` statement that iterates over the whole range. – Praetorian Jun 18 '14 at 22:30
  • 1
    Google "Duff's device". – Jon Jun 18 '14 at 22:30

1 Answers1

3

This technique is called loop unrolling to avoid the cost of checking the condition at every iteration with a tradeoff against binary size (and space).

The speedup also depends on your architecture but usually a super-scalar cpu can take advantage on that by breaking potentially dangerous dependency-chains that might stall your cpu when cache missing.

Although the theoretical complexity is the same (if you consider comparisons the dominant operations), the algorithm (on a cpu as I described) can perform significantly faster. As far as the complexity constraints are respected, any STL library can implement its own version though.

Related answers: When, if ever, is loop unrolling still useful?

Community
  • 1
  • 1
Marco A.
  • 43,032
  • 26
  • 132
  • 246
  • Indeed, this implementation replaces `n` end-of-loop tests by two subtractions and `n / 4` is-positive tests on an integer. – FredericS Jun 18 '14 at 22:41
  • Aggressive loop unrolling is also a notably famous optimization on massively parallel architectures (e.g. CUDA), but other factors come into play like register pressure and different types of caches – Marco A. Jun 18 '14 at 22:42