C++ isn't assembly language, and a compiler could compile your current function to branchless asm if it wanted to. (Dereferencing a struct pointer to load one member implies that the entire struct object is present and can thus be speculatively read without risk of faulting even if the C++ abstract machine wouldn't have touched y or z members.) What compiler(s) for what architecture(s) do you care about the most?
Have you tried compiling with profile-guided optimization so the compiler can see that branches are unpredictable? This might lead it to do if-conversion of the if()
into branchless cmov
or whatever, depending on the target ISA. (Generate your random data with rand() & 0x7
or something so it's not rare for objects to have equal x and equal y, and actually reach the z
case.)
It's possible to use SIMD to find the first non-matching element, then return the diff of that element. For example, x86 SIMD has a movemask
operation that can turn a vector compare result into an integer bitmask, which we can use with a bitscan instruction to find the first or last set bit.
(This depends on being able to safely read 16 bytes from your 12-byte struct, assuming x86. This is the case as long as your array doesn't end with the last element right at the end of a page, with the next page unmapped. Is it safe to read past the end of a buffer within the same page on x86 and x64? generally yes, and is widely used for efficient implementations of strlen and similar functions.)
(ARM NEON doesn't have a convenient movemask so for ARM / AArch64 you might be better off shuffling data around within a SIMD vector to come up with a result, if SIMD is a win at all. It might not be with ARM's predicated compare instructions, or with AArch64's more limited branchless conditional instructions that are still better than x86 CMOV.)
SIMD can give us good throughput but probably poor latency compared to @Scheff's branchless arithmetic version in comments, especially on a wide pipeline like modern x86 that can do lots of independent work in parallel (like turning separate compare results into boolean integers). High latency might not be ideal in a QSort where you expect branch mispredicts to not be rare; overlapping independent compares with out-of-order execution only works when branches are predicted correctly.
To get a + / 0 / - result from two int
values, you can cast to int64_t and subtract. That avoids the possibility of signed overflow, and is efficient on 64-bit ISAs. (Or if it can inline, ideally can compile to just a 32-bit signed compare instead of actual subtraction. 32-bit subtraction could have signed overflow which is UB, and would lose the result on wrapping). If you don't need to normalize to +1 / 0 / -1, do that.
I used an anonymous struct inside a union with an array to extend @Scheff's handy benchmark framework (with bugfix) without changing everything from a->x
to a->vals.x
.
#include <stdint.h>
#include <immintrin.h>
union Obj {
struct { // extension: anonymous struct
int x;
int y;
int z;
};
int elems[3];
};
// a better check would be on value ranges; sizeof can include padding
static_assert( sizeof(int64_t) > sizeof(int), "we need int smaller than int64_t");
int64_t compare_x86(const Obj *a, const Obj *b)
{
__m128i va = _mm_loadu_si128((const __m128i*)a); // assume over-read is safe, last array object isn't at the end of a page.
__m128i vb = _mm_loadu_si128((const __m128i*)b);
__m128i veq = _mm_cmpeq_epi32(va,vb);
unsigned eqmsk = _mm_movemask_ps(_mm_castsi128_ps(veq));
eqmsk |= 1<<2; // set elems[2]'s bit so we'll return that (non)diff if they're all equal
unsigned firstdiff = __builtin_ctz(eqmsk); // GNU C extension: count trailing zeros
// sign-extend to 64-bit first so overflow is impossible, giving a +, 0, or - result
return a->elems[firstdiff] - (int64_t)b->elems[firstdiff];
}
On Godbolt with GCC9.3 -O3 -march=skylake -fno-tree-vectorize
for x86-64, it compiles to this asm for the non-inline case:
compare_x86(Obj const*rdi, Obj const*rsi):
vmovdqu xmm1, XMMWORD PTR [rsi]
vpcmpeqd xmm0, xmm1, XMMWORD PTR [rdi]
vmovmskps edx, xmm0 # edx = bitmask of the vector compare result
or edx, 4
tzcnt edx, edx # rdx = index of lowest set bit
mov edx, edx # stupid compiler, already zero-extended to 64-bit
movsx rax, DWORD PTR [rdi+rdx*4] # 32->64 sign extending load
movsx rdx, DWORD PTR [rsi+rdx*4]
sub rax, rdx # return value in RAX
ret
The latency critical path goes through the SIMD loads + compare, through movemask back to integer, or
(1 cycle), tzcnt/bsf (3 cycles on Intel), then another L1d load-use latency for the movsx
loads (5 cycles). (numbers from https://agner.org/optimize/ https://uops.info/.
See also https://stackoverflow.com/tags/x86/info). The scalar load addresses aren't known until after tzcnt, so there's very little ILP here.
Modern x86 can do 2 loads per clock so we are taking advantage of that. It can overlap nicely across independent compares, though, and the total uop count is low so the bottleneck on front-end bandwidth isn't too bad.
The unaligned SIMD loads have no penalty on Intel CPUs unless they cross a cache-line boundary. Then latency is an extra 10 cycles or so. Or worse if they cross a 4k boundary, especially on Intel before Skylake made page splits a lot cheaper. For random 4-byte-aligned object addresses, there are 3 out of 16 start positions that lead to a cache-line split load (for 64B cache lines). This further increases the average latency from the input addresses being ready to the compare result being ready, and can't overlap with any work.
Without -march=skylake
GCC uses a separate movdqu
unaligned load, and rep bsf
which is the same instruction as tzcnt
. CPUs without BMI1 will decode it as plain bsf
. (They differ only when the input is zero; we make sure that doesn't happen. bsf
is slow on AMD, same speed as tzcnt
on Intel.)
Using @Scheff's benchmark (which counts the results) on Godbolt, this is somewhat faster than the plain scalar "arithmetic" version when you disable auto-vectorization. (GCC can auto-vec the arithmetic version.) Timing results are inconsistent between runs because the test-case is too small and the AWS servers that compiler explorer runs on might have different CPU frequencies, although they're all Skylake-avx512. But within one run, alternating between this and arith, a result like this is typical:
compare_x86() 5. try: 28 mus (<: 3843, >: 3775)
compareArithm() 5. try: 59 mus (<: 4992, >: 5007)
compare_x86() 6. try: 39 mus (<: 3843, >: 3775)
compareArithm() 6. try: 64 mus (<: 4992, >: 5007)
compare_x86() 7. try: 27 mus (<: 3843, >: 3775)
compareArithm() 7. try: 64 mus (<: 4992, >: 5007)
But remember, this is just adding up the <0
and >0
return values, and thus is throughput bound, not latency. A new compare can start without any data dependency or control dependency on the previous compare result.
Hmm, I could have use pmovmskb
to get the high bit of every byte, instead of every dword with the ps
version, but C makes it inconvenient to use a byte offset into an int
array instead of an element offset. In asm you'd tzcnt or BSF and then movsx rax, [rdi + rdx]
. This might save a cycle of latency in bypass delay between SIMD-integer pcmpeqd
and SIMD-FP movmskps
. But to get that from a compiler you'd maybe have to cast to char*
for the pointer addition then back to int*
.
I thought at first of using _mm_cmpgt_epi32(va,vb)
to get a vector of 0 / -1 compare results for signed greater-than, but then I realized that indexing the original structs would be just as easy as mapping the right element or bit of that into a -1 / +1 integer.
If you wanted to special case the all-equal case, you might set bit #3 instead (|= 1<<3
), then branch on that rare case but still do the rest branchlessly.
eqmsk |= 1<<3; // set the 4th bit so there's a non-zero bit to find
unsigned firstdiff = __builtin_ctz(eqmsk);
if (firstdiff >= 3) // handle this rare(?) case with a branch
return 0;
... something with (a < b) * 2 - 1
Mixed branchy strategy:
If it's rare that the x
s are equal, perhaps consider
if (a->x != b->x)
return a->x - (int_fast64_t)b->x;
else {
8-byte branchless SIMD?
or maybe just 2 element branchless scalar
}
IDK if it's worth doing SIMD at all for only 2 more elements. Probably not.
Or perhaps consider doing branchless for x and y, and branching on y
components being equal to skip scalar z
? If your objects are random over most of the range of int
, it's going to be rare that you find two that only differ in the last component.
I think the way good sorting algorithms do fewer comparisons by avoiding redundant comparisons probably creates more entropy in the pattern of results, and probably also increases the amount of comparisons done with elements that are "close" to each other in the final sort order. So QSort could be doing more comparisons that do need to check y elements if there are many elements with equal x.