None of the answers so far have given a version of f()
that gcc or clang can fully optimize. They all generate asm that does both compares each iteration. See the code with asm output on the Godbolt Compiler Explorer. (Important background knowledge for predicting performance from asm output: Agner Fog's microarchitecture guide, and other links on the x86 tag wiki. As always, it usually works best to profile with performance counters to find stalls.)
v[i-1] < v[i]
is work we already did last iteration, when we evaluated v[i] < v[i+1]
. In theory, helping the compiler grok that would let it optimize better (see f3()
). In practice, that ends up defeating auto-vectorization in some cases, and gcc emits code with partial-register stalls, even with -mtune=core2
where that's a huge problem.
Manually hoisting the v.size() - 1
out of the loop's upper bound check seems to help. The OP's f0
and f1
don't actually re-compute v.size()
from the start/end pointers in v
, but somehow it still optimizes less well than when computing a size_t upper = v.size() - 1
outside the loop (f2()
and f4()
).
A separate issue is that using an int
loop counter with a size_t
upper bound means the loop is potentially infinite. I'm not sure how much impact this has on other optimizations.
Bottom line: compilers are complex beasts. Predicting which version will optimize well is not at all obvious or straightforward.
Results on 64bit Ubuntu 15.10, on Core2 E6600 (Merom/Conroe microarchitecture).
clang++-3.8 -O3 -march=core2 | g++ 5.2 -O3 -march=core2 | gcc 5.2 -O2 (default -mtune=generic)
f0 1.825ms min(1.858 med) | 5.008ms min(5.048 med) | 5.000 min(5.028 med)
f1 4.637ms min(4.673 med) | 4.899ms min(4.952 med) | 4.894 min(4.931 med)
f2 1.292ms min(1.323 med) | 1.058ms min(1.088 med) (autovec) | 4.888 min(4.912 med)
f3 1.082ms min(1.117 med) | 2.426ms min(2.458 med) | 2.420 min(2.465 med)
f4 1.291ms min(1.341 med) | 1.022ms min(1.052 med) (autovec) | 2.529 min(2.560 med)
Results would be different on Intel SnB-family hardware, esp. IvyBridge and later where there would be no partial register slowdowns at all. Core2 is limited by slow unaligned loads, and only one load per cycle. The loops may be small enough that decode isn't an issue, though.
f0
and f1
:
gcc 5.2: The OP's f0
and f1
both make branchy loops, and won't auto-vectorize. f0
only uses one branch, though, and uses a weird setl sil
/ cmp sil, 1
/ sbb eax, -1
to do the second half of the short-circuit compare. So it's still doing both comparisons on every iteration.
clang 3.8: f0
: only one load per iteration, but does both compares and and
s them together. f1
: both compares each iteration, one with a branch to preserve the C semantics. Two loads per iteration.
int f2() {
int n = 0;
size_t upper = v.size()-1; // difference from f0: hoist upper bound and use size_t loop counter
for (size_t i = 1; i < upper; ++i) {
int a = v[i-1], b = v[i], c = v[i+1];
if (a < b && b < c)
++n;
}
return n;
}
gcc 5.2 -O3
: auto-vectorizes, with three loads to get the three offset vectors needed to produce one vector of 4 compare results. Also, after combining the results from two pcmpgtd
instructions, compares them against a vector of all-zero and then masks that. Zero is already the identity element for addition, so that's really silly.
clang 3.8 -O3
: unrolls: every iteration does two loads, three cmp/setcc, two and
s, and two add
s.
int f4() {
int n = 0;
size_t upper = v.size()-1;
for (size_t i = 1; i < upper; ++i) {
int a = v[i-1], b = v[i], c = v[i+1];
bool ab_lt = a < b;
bool bc_lt = b < c;
n += (ab_lt & bc_lt); // some really minor code-gen differences from f2: auto-vectorizes to better code that runs slightly faster even for this large problem size
}
return n;
}
- gcc 5.2
-O3
: autovectorizes like f2
, but without the extra pcmpeqd
.
- gcc 5.2
-O2
: didn't investigate why this is twice as fast as f2
.
- clang
-O3
: about the same code as f2
.
Attempt at compiler hand-holding
int f3() {
int n = 0;
int a = v[0], b = v[1]; // These happen before checking v.size, defeating the loop vectorizer or something
bool ab_lt = a < b;
size_t upper = v.size()-1;
for (size_t i = 1; i < upper; ++i) {
int c = v[i+1]; // only one load and compare inside the loop
bool bc_lt = b < c;
n += (ab_lt & bc_lt);
ab_lt = bc_lt;
a = b; // unused inside the loop, only the compare result is needed
b = c;
}
return n;
}
clang 3.8 -O3
: Unrolls with 4 loads inside the loop (clang typically likes to unroll by 4 when there aren't complex loop-carried dependencies).
4 cmp/setcc, 4x and/movzx, 4x add. So clang did exactly what I was hoping, and made near-optimal scalar code. This was the fastest non-vectorized version, and (on core2 where movups
unaligned loads are slow) is as fast as gcc's vectorized versions.
gcc 5.2 -O3
: Fails to auto-vectorize. My theory on that is that accessing the array outside the loop confuses the auto-vectorizer. Maybe because we do it before checking v.size()
, or maybe just in general.
Compiles to the scalar code we'd hope for, with one load, one cmp/setcc, and one and
per iteration. But gcc creates a partial-register stall, even with -mtune=core2
where it's a huge problem (2 to 3 cycle stall to insert a merging uop when reading a wide reg after writing only part of it). (setcc
is only available with an 8-bit operand size, which IMO is something AMD should have changed when they designed the AMD64 ISA.) It's the main reason why gcc's code runs 2.5x slower than clang's.
## the loop in f3(), from gcc 5.2 -O3 (same code with -O2)
.L31:
add rcx, 1 # i,
mov edi, DWORD PTR [r10+rcx*4] # a, MEM[base: _19, index: i_13, step: 4, offset: 0]
cmp edi, r8d # a, a # gcc's verbose-asm comments are a bit bogus here: one of these `a`s is from the last iteration, so this is really comparing c, b
mov r8d, edi # a, a
setg sil #, tmp124
and edx, esi # D.111089, tmp124 # PARTIAL-REG STALL: reading esi after writing sil
movzx edx, dl # using movzx to widen sil to esi would have solved the problem, instead of doing it after the and
add eax, edx # n, D.111085 # n += ...
cmp r9, rcx # upper, i
mov edx, esi # ab_lt, tmp124
jne .L31 #,
ret