-1

I have a core function evaluating 4+ simple arithmetic comparisons to return a bool. This will be called O(N^2) times in a very large loop, with a single conditional branch based on the return.

If the function is written as:

    return x < y && g < h && m < n && q < r;

will it generate 3 conditional branches compared to the 0 using “&”? This code will be publicly released, and as such may be compiled on many different platforms with many different compiler implementations.

While a single implementation might be clever enough to optimize away short circuiting, is anything like this written into the standard (c++11, 14, 17, or 20)? Would it be “safer” for performance to just use “&”?

gmaggiol
  • 15
  • 3
  • 6
    Have you made a user defined `operator&&`? If not, short-circuting is mandated. From left to right, if one fails, the others aren't evaluated. – Ted Lyngmo Feb 21 '21 at 11:14
  • 4
    Have you measured and profiled to check if this could really by a top-two or three bottleneck in your program? If not then my recommendation is to not worry about such micro-optimizations. – Some programmer dude Feb 21 '21 at 11:18
  • just order it such that the left most of && is the most likely to be false. Nothing after the first false is evaluated since you know by then that the return value is false. Anything more clever than that would likely need to depend on what's going on to make the values. – Abel Feb 21 '21 at 13:01
  • 1
    For performance related concerns, **profile** the optimized code is the only way. Otherwise any performance related code modifications is speculation, and may pessimize the performance... because you just don't know, and optimizers are amazingly good. – Eljay Feb 21 '21 at 13:04

1 Answers1

0

I've put up 4 example of your code, with certain caveats on Godbolt.

  • no side effects
  • trivial types
#include <stdint.h>
#include <xmmintrin.h>

bool Test1(int x, int y, int g, int h, int m, int n, int q, int r) {
  return x < y && g < h && m < n && q < r;
}
bool Test2(int x, int y, int g, int h, int m, int n, int q, int r) {
  const bool a = x < y && g < h;
  const bool b = m < n && q < r;
  return a && b;
}
bool TestSIMD(__m128i v1, __m128i v2) {
  __m128i vcmp = _mm_cmplt_epi32(v1, v2);
  uint16_t mask = _mm_movemask_epi8(vcmp);
  return (mask == 0xffff);
}
bool Test4(int x, int y, int g, int h, int m, int n, int q, int r) {
  return x < y & g < h & m < n & q < r;
}

This compiles to

Test1(int, int, int, int, int, int, int, int):
        cmp     edi, esi
        setl    al
        cmp     edx, ecx
        setl    dl
        and     al, dl
        je      .L1
        cmp     r8d, r9d
        mov     ecx, DWORD PTR [rsp+16]
        setl    al
        cmp     DWORD PTR [rsp+8], ecx
        setl    dl
        and     eax, edx
.L1:
        ret
Test2(int, int, int, int, int, int, int, int):
        cmp     r8d, r9d
        mov     r10d, DWORD PTR [rsp+16]
        setl    al
        cmp     DWORD PTR [rsp+8], r10d
        setl    r8b
        and     eax, r8d
        cmp     edx, ecx
        setl    dl
        and     eax, edx
        cmp     edi, esi
        setl    dl
        and     eax, edx
        ret
TestSIMD(long long __vector(2), long long __vector(2)):
        vpcmpgtd        xmm1, xmm1, xmm0
        vpmovmskb       eax, xmm1
        cmp     ax, -1
        sete    al
        ret
Test4(int, int, int, int, int, int, int, int):
        cmp     r8d, r9d
        mov     r10d, DWORD PTR [rsp+16]
        setl    al
        cmp     DWORD PTR [rsp+8], r10d
        setl    r8b
        and     eax, r8d
        cmp     edx, ecx
        setl    dl
        and     eax, edx
        cmp     edi, esi
        setl    dl
        and     eax, edx
        ret

Cycles times is approximated as I haven't bothered with analysing every instruction.

  • The first case has a conditional branch, which can be bad if the branch is not correctly predicted every time. Takes 2(correct predict early out),3,4 or 2+12 (branch mis-predict). The compiler is taking liberties with short circuit as the data is trivial with no side effect.
  • The second case doesn't have branches but like the first takes 3 or 4 cycles. But the execution time should be the same every time.
  • The SIMD solution take 4 cycles, due to data dependency, without branches. But uses much less of the pipeline and therefore can overlap with more instructions. Also the data must be loadable into the registers or already present in the registers which cost at least one additional cycle.
  • The & solution also takes 4 cycles but also uses 13 instructions.

So if this is close to your problem use either SIMD if you can else use the fastest of Test2 and Test4 on your platform.

Surt
  • 15,501
  • 3
  • 23
  • 39
  • Thank you, this is a very helpful response! Unfortunately the data will be randomly uniformly distributed among the variables, so prediction will not be helpful. I’m surprised to see the compiler removes the first and third branch, since as the other commenter said, short circuiting is mandated. – gmaggiol Feb 21 '21 at 16:23
  • @gmaggiol they are mandatory, but!!! the compiler is in many cases allowed to do "as-if" and as there are no side effects it can just calculate all. – Surt Feb 21 '21 at 17:42