61

In this code snippet, I'm comparing performance of two functionally identical loops:

for (int i = 1; i < v.size()-1; ++i) {
  int a = v[i-1];
  int b = v[i];
  int c = v[i+1];

  if (a < b  &&  b < c)
    ++n;
}

and

for (int i = 1; i < v.size()-1; ++i) 
  if (v[i-1] < v[i]  &&  v[i] < v[i+1])
    ++n;

The first one runs significantly slower than the second one across a number of different C++ compilers with optimization flag set to O2:

  • second loop is about 330% slower now with Clang 3.7.0
  • second loop is about 2% slower with gcc 4.9.3
  • second loop is about 2% slower with Visual C++ 2015

I'm puzzled that modern C++ optimizers have problems handling this case. Any clues why? Do I have to write ugly code without using temporary variables in order to get the best performance?

Using temporary variables makes the code faster, sometimes dramatically, now. What is going on?

The full code I'm using is provided below:

#include <algorithm>
#include <chrono>
#include <random>
#include <iomanip>
#include <iostream>
#include <vector>

using namespace std;
using namespace std::chrono;

vector<int> v(1'000'000);

int f0()
{
  int n = 0;

  for (int i = 1; i < v.size()-1; ++i) {
    int a = v[i-1];
    int b = v[i];
    int c = v[i+1];

    if (a < b  &&  b < c)
      ++n;
  }

  return n;
}


int f1()
{
  int n = 0;

  for (int i = 1; i < v.size()-1; ++i) 
    if (v[i-1] < v[i]  &&  v[i] < v[i+1])
      ++n;

  return n;
}


int main()
{
  auto benchmark = [](int (*f)()) {
    const int N = 100;

    volatile long long result = 0;
    vector<long long>  timings(N);

    for (int i = 0; i < N; ++i) {
      auto t0 = high_resolution_clock::now(); 
      result += f();
      auto t1 = high_resolution_clock::now(); 

      timings[i] = duration_cast<nanoseconds>(t1-t0).count();
    }

    sort(timings.begin(), timings.end());
    cout << fixed << setprecision(6) << timings.front()/1'000'000.0 << "ms min\n";
    cout << timings[timings.size()/2]/1'000'000.0 << "ms median\n" << "Result: " << result/N << "\n\n";
  };

  mt19937                    generator   (31415);   // deterministic seed
  uniform_int_distribution<> distribution(0, 1023);

  for (auto& e: v) 
    e = distribution(generator);

  benchmark(f0);
  benchmark(f1);

  cout << "\ndone\n";

  return 0;
}
Matt
  • 74,352
  • 26
  • 153
  • 180
Paul Jurczak
  • 7,008
  • 3
  • 47
  • 72
  • 3
    In second version, condition `v[i] < v[i+1]` is only checked if the first fails. In first version, `v[i+1]` is always copied. – O'Neil Apr 05 '16 at 00:46
  • You may have caching side-effects. Do you get the same results if you reverse the call order of the two functions being tested? – paddy Apr 05 '16 at 00:48
  • @paddy The order doesn't matter - I checked. Each measurement is repeated 100 times - see linked code. – Paul Jurczak Apr 05 '16 at 00:51
  • 4
    In first version, `v[i-1]` `v[i]` `v[i+1]` are always evaluted, hence after `++i` compiler already has the values of `v[i-1]` and `v[i]` in hand and doesn't need to read them again, only needs to read `v[i+1]`. In second version, `&&` short-circuit thwarts this optimization. (Theory) – Oktalist Apr 05 '16 at 00:52
  • The results don't seem that far off with clang: http://coliru.stacked-crooked.com/a/cd69e1bfc27b7a17 – Brandon Apr 05 '16 at 01:22
  • @Brandon I tried the code on Coliru and there is about 3x difference with Clang, similar to what I got from rextester. The difference becomes larger, if I use `array` or plain C array. – Paul Jurczak Apr 05 '16 at 01:54
  • 3
    Use `-O3`. The result at non-maximum optimization levels don't really matter. – M.M Apr 05 '16 at 02:08
  • 1
    added 2 more tests. thought you might be interested. – Richard Hodges Apr 05 '16 at 09:27

4 Answers4

50

It seems like the compiler lacks knowledge about the relationship between std::vector<>::size() and internal vector buffer size. Consider std::vector being our custom bugged_vector vector-like object with slight bug - its ::size() can sometimes be one more than internal buffer size n, but only then v[n-2] >= v[n-1].

Then two snippets have different semantics again: first one has undefined behavior, as we access element v[v.size() - 1]. The second one, however, doesn't have: due to short-circuit nature of &&, we don't ever read v[v.size() - 1] on the last iteration.

So, if compiler can't prove that our v is not a bugged_vector, it must short-circuit, which introduce additional jump in a machine code.

By looking at assembly output from clang, we can see that it actually happens.

From the Godbolt Compiler Explorer, with clang 3.7.0 -O2, the loop in f0 is:

### f0: just the loop
.LBB1_2:                                # =>This Inner Loop Header: Depth=1
    mov     edi, ecx
    cmp     edx, edi
    setl    r10b
    mov     ecx, dword ptr [r8 + 4*rsi + 4]
    lea     rsi, [rsi + 1]
    cmp     edi, ecx
    setl    dl
    and     dl, r10b
    movzx   edx, dl
    add     eax, edx
    cmp     rsi, r9
    mov     edx, edi
    jb      .LBB1_2

And for f1:

### f1: just the loop
.LBB2_2:                                # =>This Inner Loop Header: Depth=1
    mov     esi, r10d
    mov     r10d, dword ptr [r9 + 4*rdi]
    lea     rcx, [rdi + 1]
    cmp     esi, r10d
    jge     .LBB2_4                     # <== This is Extra Jump
    cmp     r10d, dword ptr [r9 + 4*rdi + 4]
    setl    dl
    movzx   edx, dl
    add     eax, edx
.LBB2_4:                                # %._crit_edge.3
    cmp     rcx, r8
    mov     rdi, rcx
    jb      .LBB2_2

I've pointed out the extra jump in f1. And as we (hopefuly) know, conditional jumps in a tight loops are bad for performance. (See the performance guides in the tag wiki for details.)

GCC and Visual Studio are aware that std::vector is well-behaved, and produce almost identical assembly for both snippets. Edit. It turns out clang does better job optimizing the code. All three compilers can't prove that it is safe to read v[i + 1] prior to comparison in the second example (or choose not to), but only clang manages to optimize the first example with the additional information that reading v[i + 1] is either valid or UB.

A performance difference of 2% is negligible can be explained by different order or choice of some instructions.

Matt
  • 74,352
  • 26
  • 153
  • 180
  • Why is access to `v[v.size()-1]` an undefined behavior? It is a legal last element of `v`. – Paul Jurczak Apr 05 '16 at 01:07
  • 2
    @Paul Jurczak, correct, but `clang` is not aware of it. So it must assume that `v` can be a `bugged_vector` where `v[v.size()-1]` is not always valid. –  Apr 05 '16 at 01:13
  • I don't understand why you wrote: _All three compilers can't prove that it is safe to read v[i + 1] prior to comparison_ Isn't it guaranteed by C++ semantics? I don't understand `bugged_vector` concept. Doesn't `std::vector` guarantee the desired behavior here? Are there documents discussing this issue you can link to? – Paul Jurczak Apr 05 '16 at 02:02
  • 1
    @Paul Jurczak, compiler can't always decide right, it means solving halting problem, so it must make only safe assumptions. The fact that we can see `v[i + 1]` is always valid according to C++ standard doesn't mean compiler can see it. If there is no specific rule implemented which tells '`v[v.size() - 1]` is ok whenever `v` is `std::vector`', then it must be handled by generic rules. And such generic rules can't assume that `std::vector` is not a `bugged_vector`, because they must work for `bugged_vector` too. –  Apr 05 '16 at 02:14
  • 4
    @PaulJurczak: in summary: `std::vector` does guarantee correct behavior. The issue is that the compiler doesn't know that. – Mooing Duck Apr 05 '16 at 02:21
  • Interesting. "Avoid branches."-- One question: I don't quite understand the assembler code -- why is there no shortcut branch in `f0`? Does the code always simply evaluate both conditions, because it's provably free of side-effects? – Peter - Reinstate Monica Apr 05 '16 at 03:55
  • @PeterA.Schneider, I tried several compilers, and some do generate the shortcut branch in `f0`, I believe they all do in the initial intermediate language emitted as a first pass when generating code for the AST (https://www.wikiwand.com/en/Abstract_syntax_tree). The optimizers in some compilers later notice they can eliminate the branch, because as you said the code is free of side effects. – amdn Apr 05 '16 at 04:42
  • 1
    I accepted this answer, but it should be read in conjunction with an answer by _amdn_ below. It elaborates on `&&` behavior, which is the key to this problem. – Paul Jurczak Apr 05 '16 at 05:22
  • @Mooing: I don't believe the standard guarantees correct behavior -- any enterprising programmer could do horribly unsafe but legal things to modify the internals so that the class invariants aren't maintained. To correctly assume safety, the compiler would actually have to know that _every_ non-inline function (e.g. all of the startup code in libstdc++ and the iostream implementation) that got called prior to the loop running is well-behaved. –  Apr 05 '16 at 05:35
  • 1
    @Hurkyl: No, all that horrible stuff you imagine would be Undefined Behavior. – MSalters Apr 05 '16 at 09:22
40

Here's additional insight to expand on @deniss' answer, which correctly diagnosed the issue.

Incidentally, this is related to the most popular C++ Q&A of all time "Why is processing a sorted array faster than an unsorted array?".

The main issue is the compiler must honor the logical AND operator (&&) and not load from v[i+1] unless the first condition is true. This is a consequence of the semantics of the Logical AND operator as well as the tightened memory model semantics introduced with C++11, the relevant clauses in the draft of the standard are

5.14 Logical AND operator [expr.log.and]

Unlike &, && guarantees left-to-right evaluation: the second operand is not evaluated if the first operand is false.
ISO C++14 Standard (draft N3797)

and for speculative reads

1.10 Multi-threaded executions and data races [intro.multithread]

23 [ Note: Transformations that introduce a speculative read of a potentially shared memory location may not preserve the semantics of the C++ program as defined in this standard, since they potentially introduce a data race. However, they are typically valid in the context of an optimizing compiler that targets a specific machine with well-defined semantics for data races. They would be invalid for a hypothetical machine that is not tolerant of races or provides hardware race detection. — end note ]
ISO C++14 Standard (draft N3797)

My guess is optimizers play it safe and currently choose not to issue speculative loads to potentially shared memory rather than special case for each target processor whether the speculative load could introduce a detectable data race for that target.

In order to implement this, the compiler generates a conditional branch. Usually this isn't noticeable because modern processors have very sophisticated branch prediction, and the misprediction rate is typically very low. However the data here is random - this kills branch prediction. The cost of a misprediction is 10 to 20 CPU cycles, considering that the CPU is typically retiring 2 instructions per cycle this is equivalent to 20 to 40 instructions. If the prediction rate is 50% (random) then every iteration has a mispredict penalty equivalent to 10 to 20 instructions - HUGE.

Note: The compiler could prove that elements v[0] to v[v.size()-2] will be referenced, in that order, regardless of the values they contain. This would allow the compiler in this case to generate code that unconditionally loads all but the last element of the vector. The last element of the vector, at v[v.size()-1], may only be loaded in the last iteration of the loop and only if the first condition is true. The compiler could therefore generate code for the loop without the short circuit branch up until the last iteration, then use different code with the short circuit branch for the last iteration - that would require the compiler knowing that the data is random and branch prediction is useless and therefore that it is worth bothering with that - compilers aren't that sophisticated - yet.

To avoid the conditional branch generated by the Logical AND (&&) and avoid loading the memory locations into local variables we can change the Logical AND operator into a Bitwise AND, code snippet here, the result is almost 4x faster when the data is random

int f2()
{
  int n = 0;

  for (int i = 1; i < v.size()-1; ++i) 
     n += (v[i-1] < v[i])  &  (v[i] < v[i+1]); // Bitwise AND

  return n;
}

Output

3.642443ms min
3.779982ms median
Result: 166634

3.725968ms min
3.870808ms median
Result: 166634

1.052786ms min
1.081085ms median
Result: 166634


done

The result on gcc 5.3 is 8x faster (live in Coliru here)

g++ --version
g++ -std=c++14  -O3 -Wall -Wextra -pedantic -pthread -pedantic-errors main.cpp -lm  && ./a.out
g++ (GCC) 5.3.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

3.761290ms min
4.025739ms median
Result: 166634

3.823133ms min
4.050742ms median
Result: 166634

0.459393ms min
0.505011ms median
Result: 166634


done

You might wonder how the compiler can evaluate the comparison v[i-1] < v[i] without generating a conditional branch. The answer depends on the target, for x86 this is possible because of the SETcc instruction, which generates a one byte result, 0 or 1, depending on a condition in the EFLAGS register, the same condition that could be used in a conditional branch, but without branching. In the generated code given by @deniss you can see setl generated, that sets the result to 1 if the condition "less than" is met, which is evaluated by the previous compare instruction:

cmp     edx, edi       ; a < b ?
setl    r10b           ; r10b = a < b ? 1 : 0
mov     ecx, dword ptr [r8 + 4*rsi + 4] ; c = v[i+1]
lea     rsi, [rsi + 1] ; ++i
cmp     edi, ecx       ; b < c ?
setl    dl             ; dl = b < c ? 1 : 0
and     dl, r10b       ; dl &= r10b
movzx   edx, dl        ; edx = zero extended dl
add     eax, edx       ; n += edx
Community
  • 1
  • 1
amdn
  • 11,314
  • 33
  • 45
  • 2
    Fun fact: on an increasing data instead of random, `f0` and `f2` are slower with `clang` than `f1`: http://rextester.com/RXGKJ2560 –  Apr 05 '16 at 03:46
  • 2
    @deniss, yes, a correctly predicted branch that avoids doing extra work is a beautiful thing - as long as it predicts correctly the branch "disappears" in the CPU front end - it the prediction is *wrong* then the pipeline is flushed and the CPU restarts fetching - a huge hit :-) – amdn Apr 05 '16 at 03:50
  • @deniss feel free to take any material from this answer to update yours, yours correctly diagnosed the problem and you deserve the credit. – amdn Apr 05 '16 at 03:56
  • 2
    _Please don't accept this answer_ As you wish, but I like your answer better. It elaborates on `&&` behavior, which is the key to this problem. – Paul Jurczak Apr 05 '16 at 05:18
  • 2
    Note that the compiler "insisting" on short-circuiting is entirely correct and desirable (e.g. `someObj && someObj->isOk()`). If you don't want short-circuiting, use `&`. That's why the two separate operators are there in the first place. – Luaan Apr 05 '16 at 08:13
  • @Luaan, you are right, I will update the answer, thank you – amdn Apr 05 '16 at 08:21
  • 1
    This answer also suggests implicitly to try `if (v[i] < v[i+1] && v[i-1] < v[i])`. The compiler can now **prove** `v[i+1]` is unconditionally used. – MSalters Apr 05 '16 at 09:27
  • @MSalters, the problem then is that `v[0]` must not be referenced if `v[1] >= v[2]`, so the compiler must short circuit on the first iteration. – amdn Apr 05 '16 at 09:50
  • 1
    @amdn: No, there's no such restriction. `v[0]` exists and is not volatile, so a compiler can speculatively read it under the as-if rule. But it can't do such a speculative read on `v[v.size()]` aka `*v.end()`. – MSalters Apr 05 '16 at 09:58
  • @MSalters, the code here never touches v[v.size()] because of the way the for loop condition is written, `i < v.size()-1;` – amdn Apr 05 '16 at 10:00
  • There are 2 parts to the transformation. && -> & makes f1 equivalent to f0. n+=cond; instead of if(cond)++n; is cmove-like and lets the compiler vectorize (if it was cleverer, it could do that second change on its own). – Marc Glisse Apr 05 '16 at 10:00
  • @amdn: True, but the problem for the optimizer is that this is achieved through an unavoidable short-circuit on &&. By reversing the condition, it turns out that the short-circuit behavior is no longer observable, and thus the short-circuit can be optimized out in exchange for a non-observable read of v[0]. – MSalters Apr 05 '16 at 10:07
  • @MSalters, perhaps mentioning the 'as-if' rule was a mistake, in thinking more about this I think as it stands, with the left operand of && being `v[i-1] < v[i]` the compiler knows that elements `v[0]` to `v[v.size()-2]` (all but the last) will be referenced regardless of the values contained, but a reference to `v[v.size()-1]` depends on the values in the previous two elements, requiring the short-circuit branch. If the left operand of && were replaced as you suggest with `v[i] < v[i+1]` by symmetry the issue is now the reference to `v[0]`, which depends on the value of `v[1], v[2]`. – amdn Apr 05 '16 at 10:28
  • @amdn: It's not symmetrical. The vector will actually hold a non-volatile pointer to &v[0], which means there must be an element there. &v[0] cannot be a pointer to one-after-an-array because its being compared to v[1]. – MSalters Apr 05 '16 at 12:18
  • @MSalters, we know that according to the STL definition of std::vector elements with index in the range 0..size()-1 are valid to read, but the compiler implements the core language, and what it sees is the implementation of std::vector as a class template, it must follow the C++ core standard, and there is nothing in the standard that indicates a pointer may be dereferenced speculatively. – amdn Apr 05 '16 at 14:09
  • 1
    @amdn: It actually is in the core language. Arrays are contiguous (do not have holes) and you cannot do pointer math between unrelated objects. This means that merely doing `ptr+N` means that there are at least N consecutive array elements starting at `ptr`. That is the core reason for my suggested change: because I tell the compiler that `p[2]` exists, on penalty of UB, it then knows from the pointer arithmetic that the element p[0] also exists. The other way around doesn't work: the compiler cannot know that p[2] exists from p[0]. – MSalters Apr 05 '16 at 14:28
  • @MSalters: I tested your hypothesis about reversing the two operands of `&&` and from the timings (no disassembly) it seems much closer to `f0` than `f1` so this might indeed help the compiler remove the branch. – Matthieu M. Apr 05 '16 at 15:11
  • 1
    @MSalters: Just because you have a pointer to `&v[0]` does not mean that it is valid to speculatively load it. `v[0]` might have an indeterminate value, and it might happen that the actual bit pattern stored there is a signalling NaN that breaks your program if read into a floating-point register. Itanium goes to a lot of effort to delay trap effects during speculative loads; other architectures aren't so fancy. The C++ language requires short-circuiting for `&&`, which requires that when `v[1] > v[2]`, side effects produced by evaluating `v[0]` do not occur. Pointer math rules don't help. – Ben Voigt Apr 05 '16 at 15:32
  • @BenVoigt: Would a call to a dummy function `inline dummy_array_access(int n, int arr[static n]) {};` entitle a compiler to perform speculative reads of arr[0]..arr[n-1]` at its leisure? Also, if a processor core doesn't have trap representations for a particular type, would a compiler be expected to refrain from speculative reads if the only way they could cause traps would be if code used mechanisms outside the C Standard to set them up [e.g. some processors have features to trap when particular (configurable) addressees are read, but would not normally trap on reads]. – supercat Apr 05 '16 at 15:58
  • 1
    @supercat: I'm not familiar with that syntax. Does it somehow prevent the argument type translation that makes `arr` actually a pointer not an array? In any case, having an lvalue does not guarantee that you can perform lvalue->rvalue conversion; it's perfectly legal to make an lvalue to an uninitialized object. On some architectures, speculative reads of integral types may never trap, but MSalters claimed that the language guaranteed it, and it certainly does not. – Ben Voigt Apr 05 '16 at 16:06
  • @BenVoigt: I believe the syntax guarantees that all locations from *arr to *(arr+N-1) are accessible, and the purpose is to allow speculative accesses. What the language guarantees is that the compiler will only perform speculative reads that cannot affect the observable behavior of the program; I would take that as implying that if an implementation specifies observable behaviors beyond those defined by the Standard, that implementation must abide by that requirement as well, but the fact that a behavior might be observable in ways not understood by the implementation does not... – supercat Apr 05 '16 at 16:24
  • @supercat: Oh I just noticed that your question refers to the C standard. This is a C++ question. That probably explains why you're using syntax that's illegal in C++. – Ben Voigt Apr 05 '16 at 16:26
  • ...make it "observable" for purposes of conformance. Sorry I lose track sometimes of which features have made it into which language, especially since some implementations of each language include features from the other as extensions. In any case, for platforms which don't natively support trap representations of a particular type, I would think that a single dummy read of arr[size-1] should suffice to indicate that arr[0] to arr[size-1] are all accessible on any platform which doesn't define an extended concept of "observable behavior". – supercat Apr 05 '16 at 16:28
7

f0 and f1 are semantically different.

x() && y() involves a short-circuit in the case of x() being false as we know. This means that if x() is false , then y() must not be evaluated.

This prevents prefetching of the data in order to evaluate y() and (at least on clang) is causing the insertion of a conditional jump, which is resulting in branch-predictor misses.

Adding another 2 tests proves the point.

#include <algorithm>
#include <chrono>
#include <random>
#include <iomanip>
#include <iostream>
#include <vector>

using namespace std;
using namespace std::chrono;

vector<int> v(1'000'000);

int f0()
{
    int n = 0;

    for (int i = 1; i < v.size()-1; ++i) {
        int a = v[i-1];
        int b = v[i];
        int c = v[i+1];

        if (a < b  &&  b < c)
            ++n;
    }

    return n;
}


int f1()
{
    int n = 0;

    auto s = v.size() - 1;
    for (size_t i = 1; i < s; ++i)
        if (v[i-1] < v[i]  &&  v[i] < v[i+1])
            ++n;

    return n;
}

int f2()
{
    int n = 0;

    auto s = v.size() - 1;
    for (size_t i = 1; i < s; ++i)
    {
        auto t1 = v[i-1] < v[i];
        auto t2 = v[i] < v[i+1];
        if (t1 && t2)
            ++n;
    }

    return n;
}

int f3()
{
    int n = 0;

    auto s = v.size() - 1;
    for (size_t i = 1; i < s; ++i)
    {
        n += 1 * (v[i-1] < v[i]) * (v[i] < v[i+1]);
    }

    return n;
}



int main()
{
    auto benchmark = [](int (*f)()) {
        const int N = 100;

        volatile long long result = 0;
        vector<long long>  timings(N);

        for (int i = 0; i < N; ++i) {
            auto t0 = high_resolution_clock::now();
            result += f();
            auto t1 = high_resolution_clock::now();

            timings[i] = duration_cast<nanoseconds>(t1-t0).count();
        }

        sort(timings.begin(), timings.end());
        cout << fixed << setprecision(6) << timings.front()/1'000'000.0 << "ms min\n";
        cout << timings[timings.size()/2]/1'000'000.0 << "ms median\n" << "Result: " << result/N << "\n\n";
    };

    mt19937                    generator   (31415);   // deterministic seed
    uniform_int_distribution<> distribution(0, 1023);

    for (auto& e: v) 
        e = distribution(generator);

    benchmark(f0);
    benchmark(f1);
    benchmark(f2);
    benchmark(f3);

    cout << "\ndone\n";

    return 0;
}

results (apple clang, -O2):

1.233948ms min
1.320545ms median
Result: 166850

3.366751ms min
3.493069ms median
Result: 166850

1.261948ms min
1.361748ms median
Result: 166850

1.251434ms min
1.353653ms median
Result: 166850
Richard Hodges
  • 68,278
  • 7
  • 90
  • 142
  • 3
    Actually, since we're talking about optimizers, `x() && y()` only needs short-circuiting as far as observable behavior is concerned. it's quite common that the first few instructions of y can be interleaved with the last instructions of x. – MSalters Apr 05 '16 at 10:11
  • 1
    @MSalters Yes, looking at the assembler this does seem like a missed optimisation opportunity by clang. Since there's no risk of aliasing, a prefetch would be fine. clang misses that opportunity and inserts a `jge` instruction which is causing a bottleneck. – Richard Hodges Apr 05 '16 at 10:25
  • 1
    @MSalters: Also, it means that if there is no observable effect to evaluating `y()` then no branch is needed. – Matthieu M. Apr 05 '16 at 15:02
2

None of the answers so far have given a version of f() that gcc or clang can fully optimize. They all generate asm that does both compares each iteration. See the code with asm output on the Godbolt Compiler Explorer. (Important background knowledge for predicting performance from asm output: Agner Fog's microarchitecture guide, and other links on the tag wiki. As always, it usually works best to profile with performance counters to find stalls.)

v[i-1] < v[i] is work we already did last iteration, when we evaluated v[i] < v[i+1]. In theory, helping the compiler grok that would let it optimize better (see f3()). In practice, that ends up defeating auto-vectorization in some cases, and gcc emits code with partial-register stalls, even with -mtune=core2 where that's a huge problem.

Manually hoisting the v.size() - 1 out of the loop's upper bound check seems to help. The OP's f0 and f1 don't actually re-compute v.size() from the start/end pointers in v, but somehow it still optimizes less well than when computing a size_t upper = v.size() - 1 outside the loop (f2() and f4()).

A separate issue is that using an int loop counter with a size_t upper bound means the loop is potentially infinite. I'm not sure how much impact this has on other optimizations.


Bottom line: compilers are complex beasts. Predicting which version will optimize well is not at all obvious or straightforward.


Results on 64bit Ubuntu 15.10, on Core2 E6600 (Merom/Conroe microarchitecture).

clang++-3.8 -O3 -march=core2   |   g++ 5.2 -O3 -march=core2         | gcc 5.2 -O2 (default -mtune=generic)
f0    1.825ms min(1.858 med)   |   5.008ms min(5.048 med)           | 5.000 min(5.028 med)
f1    4.637ms min(4.673 med)   |   4.899ms min(4.952 med)           | 4.894 min(4.931 med)
f2    1.292ms min(1.323 med)   |   1.058ms min(1.088 med) (autovec) | 4.888 min(4.912 med)
f3    1.082ms min(1.117 med)   |   2.426ms min(2.458 med)           | 2.420 min(2.465 med)
f4    1.291ms min(1.341 med)   |   1.022ms min(1.052 med) (autovec) | 2.529 min(2.560 med)

Results would be different on Intel SnB-family hardware, esp. IvyBridge and later where there would be no partial register slowdowns at all. Core2 is limited by slow unaligned loads, and only one load per cycle. The loops may be small enough that decode isn't an issue, though.


f0 and f1:

gcc 5.2: The OP's f0 and f1 both make branchy loops, and won't auto-vectorize. f0 only uses one branch, though, and uses a weird setl sil / cmp sil, 1 / sbb eax, -1 to do the second half of the short-circuit compare. So it's still doing both comparisons on every iteration.

clang 3.8: f0: only one load per iteration, but does both compares and ands them together. f1: both compares each iteration, one with a branch to preserve the C semantics. Two loads per iteration.


int f2() {
  int n = 0;
  size_t upper = v.size()-1;   // difference from f0: hoist upper bound and use size_t loop counter
  for (size_t i = 1; i < upper; ++i) {
    int a = v[i-1], b = v[i], c = v[i+1];
    if (a < b  &&  b < c)
      ++n;
  }
  return n;
}

gcc 5.2 -O3: auto-vectorizes, with three loads to get the three offset vectors needed to produce one vector of 4 compare results. Also, after combining the results from two pcmpgtd instructions, compares them against a vector of all-zero and then masks that. Zero is already the identity element for addition, so that's really silly.

clang 3.8 -O3: unrolls: every iteration does two loads, three cmp/setcc, two ands, and two adds.


int f4() {
  int n = 0;

  size_t upper = v.size()-1;
  for (size_t i = 1; i < upper; ++i) {
      int a = v[i-1], b = v[i], c = v[i+1];
      bool ab_lt = a < b;
      bool bc_lt = b < c;

      n += (ab_lt & bc_lt);  // some really minor code-gen differences from f2: auto-vectorizes to better code that runs slightly faster even for this large problem size
  }

  return n;
}
  • gcc 5.2 -O3: autovectorizes like f2, but without the extra pcmpeqd.
  • gcc 5.2 -O2: didn't investigate why this is twice as fast as f2.
  • clang -O3: about the same code as f2.

Attempt at compiler hand-holding

int f3() {
  int n = 0;
  int a = v[0], b = v[1];   // These happen before checking v.size, defeating the loop vectorizer or something
  bool ab_lt = a < b;

  size_t upper = v.size()-1;
  for (size_t i = 1; i < upper; ++i) {
      int c = v[i+1];       // only one load and compare inside the loop
      bool bc_lt = b < c;

      n += (ab_lt & bc_lt);

      ab_lt = bc_lt;
      a = b;                // unused inside the loop, only the compare result is needed
      b = c;
  }
  return n;
}
  • clang 3.8 -O3: Unrolls with 4 loads inside the loop (clang typically likes to unroll by 4 when there aren't complex loop-carried dependencies).
    4 cmp/setcc, 4x and/movzx, 4x add. So clang did exactly what I was hoping, and made near-optimal scalar code. This was the fastest non-vectorized version, and (on core2 where movups unaligned loads are slow) is as fast as gcc's vectorized versions.

  • gcc 5.2 -O3: Fails to auto-vectorize. My theory on that is that accessing the array outside the loop confuses the auto-vectorizer. Maybe because we do it before checking v.size(), or maybe just in general.

    Compiles to the scalar code we'd hope for, with one load, one cmp/setcc, and one and per iteration. But gcc creates a partial-register stall, even with -mtune=core2 where it's a huge problem (2 to 3 cycle stall to insert a merging uop when reading a wide reg after writing only part of it). (setcc is only available with an 8-bit operand size, which IMO is something AMD should have changed when they designed the AMD64 ISA.) It's the main reason why gcc's code runs 2.5x slower than clang's.

## the loop in f3(), from gcc 5.2 -O3 (same code with -O2)
.L31:
    add     rcx, 1    # i,
    mov     edi, DWORD PTR [r10+rcx*4]        # a, MEM[base: _19, index: i_13, step: 4, offset: 0]
    cmp     edi, r8d  # a, a                 # gcc's verbose-asm comments are a bit bogus here: one of these `a`s is from the last iteration, so this is really comparing c, b
    mov     r8d, edi  # a, a
    setg    sil     #, tmp124
    and     edx, esi  # D.111089, tmp124     # PARTIAL-REG STALL: reading esi after writing sil
    movzx   edx, dl                          # using movzx to widen sil to esi would have solved the problem, instead of doing it after the and
    add     eax, edx  # n, D.111085          # n += ...
    cmp     r9, rcx   # upper, i
    mov     edx, esi  # ab_lt, tmp124
    jne     .L31      #,
    ret

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847