4

I have two doubles, a and b, which are both in [0,1]. I want the min/max of a and b without branching for performance reasons.

Given that a and b are both positive, and below 1, is there an efficient way of getting the min/max of the two? Ideally, I want no branching.

Cody Gray - on strike
  • 239,200
  • 50
  • 490
  • 574
lee
  • 740
  • 2
  • 11
  • 24
  • What do you mean by "no branching"? – shayaan Mar 11 '19 at 19:46
  • 2
    sounds a little like premature optimization. Unless you already have something, measured, found out that this is a hot spot, simply use [`std::minmax`](https://en.cppreference.com/w/cpp/algorithm/minmax), I doubt that you can get anything significantly more efficient – 463035818_is_not_an_ai Mar 11 '19 at 19:47
  • You can use `std::min()` and `std::nax()` or `std::minmax()` but there's no guarantee that these functions are implemented without branching. – πάντα ῥεῖ Mar 11 '19 at 19:47
  • @user463035818 This is definitely the hot spot, shaving milliseconds here reduces overall runtime by hours. – lee Mar 11 '19 at 19:49
  • 2
    So, going back to basics, what have you tried so far that proved being not effective enough? – Fureeish Mar 11 '19 at 19:50
  • @jxh using `std::min` and `std::max` – lee Mar 11 '19 at 19:50
  • And, just to make sure, you **are** using optimisations, aren't you? In terms of compiler flags. And you **do** build it on Release, correct? – Fureeish Mar 11 '19 at 19:51
  • @Fureeish I'm using -O3 – lee Mar 11 '19 at 19:52
  • Doubtful you can shave milliseconds unless you are running on an ancient machine. – jxh Mar 11 '19 at 19:52
  • 1
    @SpentDeath _"using `std::min` and `std::max`"_ And what's the specific performance bottleneck you observed when doing so? – πάντα ῥεῖ Mar 11 '19 at 19:52
  • In a milliseconds you can get a lot mins and maxs, I think we are rather talking about micros, and I still doubt that you can see much improvent of a handrolled `minmax` as compared to `std::minmax` – 463035818_is_not_an_ai Mar 11 '19 at 19:52
  • I dont know your details, but I am certain that if lets say you can get something that is 110% better than `std::minmax` (that would be a huge improvement) then you would need to run the code for years to see saving in runtime of hours – 463035818_is_not_an_ai Mar 11 '19 at 19:54
  • 1
    It is likely not that `min` and `max` are inefficient, but that your algorithm requires you to call them too many times. You might be able to shave milliseconds by reworking your algorithm to not need to do so many comparisons. – jxh Mar 11 '19 at 19:56
  • I'm doing `100*100*1000*640` calls to `std::min`. I previously had a `*2` operation that I precomputed that brought down my runtime significantly (brought the runtime from 12hrs down to 8hrs). I am taking the min/max of things that are strictly in [0,1]. That is special restriction that `g++` and `std::min/max` doesn't know that I should be able to take advantage of. – lee Mar 11 '19 at 19:59
  • What other operations are done with the same frequency ? –  Mar 11 '19 at 20:05
  • @YvesDaoust none, hence my question – lee Mar 11 '19 at 20:06
  • How does this number of calls to `std::min` compare to the size of your data set? – jxh Mar 11 '19 at 20:06
  • @SpentDeath: do you mean that you essentially… do nothing with these values ? –  Mar 11 '19 at 20:13
  • @YvesDaoust They values are part of other computations, but I'm trying to optimize this specific portion. – lee Mar 11 '19 at 20:17
  • Ok, keep the info for yourself. –  Mar 11 '19 at 20:17
  • Look, I know my variables are in [0,1], and I want to know if there is something clever I can do to find the min/max of them faster than what `g++` or `std::min/max` or assembly can do because those are assuming full range. I was thinking of something like: `0.5*(fabs(a - b) + a + b)` but `fabs` is as fast as `min`. – lee Mar 11 '19 at 20:21
  • 1
    Have you looked into using intrinsic functions? My CPU is nearly 10 years old but still has access to the AVX extension, which can compute the min/max of 4 doubles in one instruction via `_mm256_min_pd` and `_mm256_max_pd`. – mukunda Mar 11 '19 at 20:25
  • 1
    That call number is not that high. If calls to min and max takes 100 clocks (it is unlikely that any implementation uses that much clocks, it should be much lower - you have optimization on, do you?), then it only takes ~3 minutes (considering a 3.5GHz machine). Btw., SSE/AVX has min and max instructions, you should check out whether your compilation uses them. Current compilers are not too good at emitting these instructions. – geza Mar 11 '19 at 20:26
  • The only thing your constraint buys is that the two most significant bits are 0. It's theoretically possible that doing `min` after casting to an equivalently sized unsigned int would be slightly faster because there is no special handling for the sign bit. But I doubt this will make a real difference on modern processors. – AShelly Mar 11 '19 at 21:39

1 Answers1

26

Yes, there is a way to calculate the maximum or minimum of two doubles without any branches. The C++ code to do so looks like this:

#include <algorithm>

double FindMinimum(double a, double b)
{
    return std::min(a, b);
}

double FindMaximum(double a, double b)
{
    return std::max(a, b);
}

I bet you've seen this before. Lest you don't believe that this is branchless, check out the disassembly:

FindMinimum(double, double):
    minsd   xmm1, xmm0
    movapd  xmm0, xmm1
    ret

FindMaximum(double, double):
    maxsd   xmm1, xmm0
    movapd  xmm0, xmm1
    ret

That's what you get from all popular compilers targeting x86. The SSE2 instruction set is used, specifically the minsd/maxsd instructions, which branchlessly evaluate the minimum/maximum value of two double-precision floating-point values.

All 64-bit x86 processors support SSE2; it is required by the AMD64 extensions. Even most x86 processors without 64-bit support SSE2. It was released in 2000. You'd have to go back a long way to find a processor that didn't support SSE2. But what about if you did? Well, even there, you get branchless code on most popular compilers:

FindMinimum(double, double):
    fld      QWORD PTR [esp + 12]
    fld      QWORD PTR [esp + 4]
    fucomi   st(1)
    fcmovnbe st(0), st(1)
    fstp     st(1)
    ret

FindMaximum(double, double):
    fld      QWORD PTR [esp + 4]
    fld      QWORD PTR [esp + 12]
    fucomi   st(1)
    fxch     st(1)
    fcmovnbe st(0), st(1)
    fstp     st(1)
    ret

The fucomi instruction performs a comparison, setting flags, and then the fcmovnbe instruction performs a conditional move, based on the value of those flags. This is all completely branchless, and relies on instructions introduced to the x86 ISA with the Pentium Pro back in 1995, supported on all x86 chips since the Pentium II.

The only compiler that won't generate branchless code here is MSVC, because it doesn't take advantage of the FCMOVxx instruction. Instead, you get:

double FindMinimum(double, double) PROC
    fld     QWORD PTR [a]
    fld     QWORD PTR [b]
    fcom    st(1)            ; compare "b" to "a"
    fnstsw  ax               ; transfer FPU status word to AX register
    test    ah, 5            ; check C0 and C2 flags
    jp      Alt
    fstp    st(1)            ; return "b"
    ret
Alt:
    fstp    st(0)            ; return "a"
    ret
double FindMinimum(double, double) ENDP

double FindMaximum(double, double) PROC
    fld     QWORD PTR [b]
    fld     QWORD PTR [a]
    fcom    st(1)            ; compare "b" to "a"
    fnstsw  ax               ; transfer FPU status word to AX register
    test    ah, 5            ; check C0 and C2 flags
    jp      Alt
    fstp    st(0)            ; return "b"
    ret
Alt:
    fstp    st(1)            ; return "a"
    ret
double FindMaximum(double, double) ENDP

Notice the branching JP instruction (jump if parity bit set). The FCOM instruction is used to do the comparison, which is part of the base x87 FPU instruction set. Unfortunately, this sets flags in the FPU status word, so in order to branch on those flags, they need to be extracted. That's the purpose of the FNSTSW instruction, which stores the x87 FPU status word to the general-purpose AX register (it could also store to memory, but…why?). The code then TESTs the appropriate bit, and branches accordingly to ensure that the correct value is returned. In addition to the branch, retrieving the FPU status word will also be relatively slow. This is why the Pentium Pro introduced the FCOM instructions.

However, it is unlikely that you would be able to improve upon the speed of any of this code by using bit-twiddling operations to determine min/max. There are two basic reasons:

  1. The only compiler generating inefficient code is MSVC, and there's no good way to force it to generate the instructions you want it to. Although inline assembly is supported in MSVC for 32-bit x86 targets, it is a fool's errand when seeking performance improvements. I'll also quote myself:

    Inline assembly disrupts the optimizer in rather significant ways, so unless you're writing significant swaths of code in inline assembly, there is unlikely to be a substantial net performance gain. Furthermore, Microsoft's inline assembly syntax is extremely limited. It trades flexibility for simplicity in a big way. In particular, there is no way to specify input values, so you're stuck loading the input from memory into a register, and the caller is forced to spill the input from a register to memory in preparation. This creates a phenomenon I like to call "a whole lotta shufflin' goin' on", or for short, "slow code". You don't drop to inline assembly in cases where slow code is acceptable. Thus, it is always preferable (at least on MSVC) to figure out how to write C/C++ source code that persuades the compiler to emit the object code you want. Even if you can only get close to the ideal output, that's still considerably better than the penalty you pay for using inline assembly.

  2. In order to get access to the raw bits of a floating-point value, you'd have to do a domain transition, from floating-point to integer, and then back to floating-point. That's slow, especially without SSE2, because the only way to get a value from the x87 FPU to the general-purpose integer registers in the ALU is indirectly via memory.

If you wanted to pursue this strategy anyway—say, to benchmark it—you could take advantage of the fact that floating-point values are lexicographically ordered in terms of their IEEE 754 representations, except for the sign bit. So, since you are assuming that both values are positive:

FindMinimumOfTwoPositiveDoubles(double a, double b):
    mov   rax, QWORD PTR [a]
    mov   rdx, QWORD PTR [b]
    sub   rax, rdx              ; subtract bitwise representation of the two values
    shr   rax, 63               ; isolate the sign bit to see if the result was negative
    ret

FindMaximumOfTwoPositiveDoubles(double a, double b):
    mov   rax, QWORD PTR [b]    ; \ reverse order of parameters
    mov   rdx, QWORD PTR [a]    ; /  for the SUB operation
    sub   rax, rdx
    shr   rax, 63
    ret

Or, to avoid inline assembly:

bool FindMinimumOfTwoPositiveDoubles(double a, double b)
{
    static_assert(sizeof(a) == sizeof(uint64_t),
                  "A double must be the same size as a uint64_t for this bit manipulation to work.");
    const uint64_t aBits = *(reinterpret_cast<uint64_t*>(&a));
    const uint64_t bBits = *(reinterpret_cast<uint64_t*>(&b));
    return ((aBits - bBits) >> ((sizeof(uint64_t) * CHAR_BIT) - 1));
}

bool FindMaximumOfTwoPositiveDoubles(double a, double b)
{
    static_assert(sizeof(a) == sizeof(uint64_t),
                  "A double must be the same size as a uint64_t for this bit manipulation to work.");
    const uint64_t aBits = *(reinterpret_cast<uint64_t*>(&a));
    const uint64_t bBits = *(reinterpret_cast<uint64_t*>(&b));
    return ((bBits - aBits) >> ((sizeof(uint64_t) * CHAR_BIT) - 1));
}

Note that there are severe caveats to this implementation. In particular, it will break if the two floating-point values have different signs, or if both values are negative. If both values are negative, then the code can be modified to flip their signs, do the comparison, and then return the opposite value. To handle the case where the two values have different signs, code can be added to check the sign bit.

    // ...

    // Enforce two's-complement lexicographic ordering.
    if (aBits < 0)
    {
        aBits = ((1 << ((sizeof(uint64_t) * CHAR_BIT) - 1)) - aBits);
    }
    if (bBits < 0)
    {
        bBits = ((1 << ((sizeof(uint64_t) * CHAR_BIT) - 1)) - bBits);
    }

    // ...

Dealing with negative zero will also be a problem. IEEE 754 says that +0.0 is equal to −0.0, so your comparison function will have to decide if it wants to treat these values as different, or add special code to the comparison routines that ensures negative and positive zero are treated as equivalent.

Adding all of this special-case code will certainly reduce performance to the point that we will break even with a naïve floating-point comparison, and will very likely end up being slower.

Community
  • 1
  • 1
Cody Gray - on strike
  • 239,200
  • 50
  • 490
  • 574
  • This is exactly what I wanted to know! Upon looking at the assembly code, it looks like `-O3` made as many optimizations as possible. I can batch my mins in groups of `50000*640`. Is that possible with SSE2? I don't know much about this stuff. Currently I've parallelized `50000*640` mins into 32 cores. -- The pains of scientific computing. – lee Mar 11 '19 at 23:36
  • Yeah, -O3 will get you loop unrolling and vectorization in GCC. Where did you arrive at these numbers, 50000*640? SSE2 uses registers that are 64 bits wide, so you can fit two double-precision floats or four single-precision floats in a single register. There are other instruction sets you can use, like AVX or AVX2. Depends on what your target CPU architecture supports. In scientific computing, you are often targeting something quite recent, not a general-purpose consumer system. Ask a question with as many details as possible. We've got a core group of awesome optimization wizards here. @spe – Cody Gray - on strike Mar 11 '19 at 23:40
  • I have 50000 vectors by vector dimension of size 640. These aren't your traditional vectors, they don't follow the same algebra as standard vectors, they aren't associative or commutative. I've already done all the mathematical optimizations I can, rearranged my algebra to minimize double point operations. I've parallelized and cached things as much as I could, and eliminated as many branching. This is the only remaining non-mathematical bottle neck. Due to the undesirable properties of my "vectors", I'm left with optimizing just `std::min`. – lee Mar 11 '19 at 23:51
  • Unfortunately, there is too much background information required, since all of this is unpublished work. My thesis is from a pure math standpoint, which is similar to fuzzy vectors via hyper rectangles, and I'm trying to experimentally verify some claims. – lee Mar 11 '19 at 23:56
  • 1
    It would be a lot easier to make suggestions if I could see some code. With a specialized format, you have a lot of room for optimizations. I was underestimating some of that when I read the question, and I wrote the answer to be a bit more general-purpose. But again, as I said, it really depends on what microarchitecture you're targeting. If you can target AVX as a baseline, you can do 4 insns (2 shuffles, 2 comparisons) to find the min of 4 values using standard FP ops. But with your special format, you might be able to do even better than that, comparing fewer than 64 bits. @spent – Cody Gray - on strike Mar 12 '19 at 00:51
  • FWIW, `std::min(b, a)` elides the `movapd` instruction in the first example, producing slightly "better" code in terms of instruction length. Presumably this is a consequence of the differing NaN propagation. – clstrfsck Feb 01 '22 at 12:43