Optimizing std::clamp with favor for in-range input: is there a point for keeping cmov instead of a branch?

Question

C++17 std::clamp is a template function that makes sure the input value is not less than the given minimum and less than the given maximum, and returns the input value; otherwise it returns the minimum or the maximum respectively.

The goal is to optimize it, assuming the following:

The type parameter is 32 bit or 64 bit integer
The input value is way more likely to be already in range than out of range, so likely to be returned
The input value is likely to be computed shortly before, the minimum and maximum is likely to be known in advance
Let's ignore references, which may complicate optimization, but in practice are not useful for integer values

For both the standard implementation, and a naive implementation, the assembly generated by gcc and clang does not seem to favor the in range assumption above. Both of these:

#include <algorithm>

int clamp1(int v, int minv, int maxv)
{
    return std::clamp(v, minv, maxv);
}

int clamp2(int v, int minv, int maxv)
{
    if (maxv < v)
    {
        return maxv ;
    }
    if (v < minv)
    {
        return minv;
    }
    return v;
}

Compile into two cmov (https://godbolt.org/z/oedd9Yfro):

        mov     eax, esi
        cmp     edi, esi
        cmovge  eax, edi
        cmp     edx, edi
        cmovl   eax, edi

Trying to tell the compilers to favor the in range case with __builtin_expect (gcc seem to ignore C++20 [[likely]]):

int clamp3(int v, int minv, int maxv)
{
    if (__builtin_expect(maxv < v, 0))
    {
        return maxv;
    }
    if (__builtin_expect(v < minv, 0))
    {
        return minv;
    }
    return v;
}

The result for gcc and clang are now different (https://godbolt.org/z/s4vedo1br). gcc still fully avoids branches using two cmov. clang has one branch, instead of expected two (annotation mine):

clamp3(int, int, int):
        mov     eax, edx     ; result = maxv
        cmp     edi, esi     ;  v, minv
        cmovge  esi, edi     ; if (v >= minv) minv = v
        cmp     edx, edi     ;  maxv, v
        jl      .LBB0_2      ; if (maxv < v) goto LBB0_2
        mov     eax, esi     ; result = minv  (after it was updated from v if no clamping)
.LBB0_2:
        ret

Questions:

Are there significant disadvantages in using conditional jumps that are expected to go the same branch each time, so that gcc avoids them?
Is clang version with one conditional jump better than it would have been if there was two jumps?

Not using cmov is suggested in Intel® 64 and IA-32 Architectures Optimization Reference Manual, from June 2021 version page 3-5:

Assembly/Compiler Coding Rule 2. (M impact, ML generality) Use the SETCC and CMOV instructions to eliminate unpredictable conditional branches where possible. Do not do this for predictable branches. Do not use these instructions to eliminate all unpredictable conditional branches (because using these instructions will incur execution overhead due to the requirement for executing both paths of a conditional branch). In addition, converting a conditional branch to SETCC or CMOV trades off control flow dependence for data dependence and restricts the capability of the out-of-order engine. When tuning, note that all Intel 64 and IA-32 processors usually have very high branch prediction rates. Consistently mispredicted branches are generally rare. Use these instructions only if the increase in computation time is less than the expected cost of a mispredicted branch.

Not-taken branches are pretty cheap, although throughput of *taken* branches can suffer if there are too many branches (conditional or not) too close together (more than 1 in 16 bytes of machine code on recent Intel, for example.) See [discussion in comments](https://stackoverflow.com/questions/31280817/what-branch-misprediction-does-the-branch-target-buffer-detect/65796517#comment123986123_65796517) between myself and Noah. Of course, with the unlikely branch targets placed out-of-line, all the branches can be predicted-not-taken. — Peter Cordes, Dec 02 '21 at 09:15
BTW, if it's very likely that no clamping is needed, ideally the compiler can do `tmp = v - minv` (with well-defined wrapping) / `if ((unsigned)v >= (maxv-minv)){ figure out clamping; }` when compiling for a 2's complement machine like x86. That would require it to prove that maxv >= minv, or hoist a check for that out of a loop. I *think* the unsigned-compare range-check trick works in the general case for runtime-variable signed ranged if you have that guarantee. See [double condition checking in assembly](https://stackoverflow.com/q/5196527) — Peter Cordes, Dec 02 '21 at 09:41
Ideally for x86, you'd get `-minv` and `mavx-minv` (unsigned without wrapping) in registers so you could `lea ecx, [rax + rdx]` / `cmp ecx, esi` to get `v - minv` done in one copy-and-add instead of mov/sub. — Peter Cordes, Dec 02 '21 at 09:43
@PeterCordes, the branch in clang codegen is not taken (I've added my annotation to it), and it seems not a problem to arrange the code for both not taken branches, so looks like a missed optimization for both compilers, right? — Alex Guteniev, Dec 02 '21 at 09:43
https://godbolt.org/z/v9KeMs89c does the unsigned-range thing in C. I think it's safe for 2's complement C implementations where `unsigned` and `int` are the same bit-width. Critical path for `v` -> retval is just one `mov`, or with inlining probably nothing. But it takes more work for a single use; with modern CPUs (Broadwell and later, and all AMD) having single-cycle `cmov`, one cmov is probably fine on the critical path. (Especially if, as you say, the min and max limits are ready ahead of time, so it's no problem to couple them in as data dependencies for the result.) — Peter Cordes, Dec 03 '21 at 03:28
@PeterCordes, looks like the win for branchy version vanish if not using JCC errata workaround (`/QIntel-jcc-erratum` on MSVC, I wonder why it is not enabled by default when not tuning specifically for AMD64) — Alex Guteniev, Dec 03 '21 at 10:05
Yeah, there are lots of Skylake-derived CPUs around in the wild, and will be for many years, so that's something compilers should really be doing as part of their default tuning. — Peter Cordes, Dec 03 '21 at 10:08
@PeterCordes, any idea about using branches here in AMD? AMD manual does not say to avoid `cmov` for predictable branch. — Alex Guteniev, Dec 03 '21 at 10:32
AMD is identical to Intel Broadwell and later, except AMD probably runs `cmova` and `cmovbe` as single-uop. Those are still 2 uop on Intel, unlike the rest that are now 1 uop. (Intel put that point into their guide back when they still ran 3-input instructions like cmov and adc as 2 uops. But it's still somewhat true, even when the clamp limits are constants, although moreso for a ternary where both sides need computation.) — Peter Cordes, Dec 03 '21 at 10:55

Optimizing std::clamp with favor for in-range input: is there a point for keeping cmov instead of a branch?

0 Answers0