110

Summary:

I'm looking for the fastest way to calculate

(int) x / (int) y

without getting an exception for y==0. Instead I just want an arbitrary result.


Background:

When coding image processing algorithms I often need to divide by an (accumulated) alpha value. The most simple variant is plain C code with integer arithmetic. My problem is that I typically get a division by zero error for result pixels with alpha==0. However this are exactly the pixels where the result doesn't matter at all: I don't care about color values of pixels with alpha==0.


Details:

I'm looking for something like:

result = (y==0)? 0 : x/y;

or

result = x / MAX( y, 1 );

x and y are positive integers. The code is executed a huge number of times in a nested loop, so I'm looking for a way to get rid of the conditional branching.

When y does not exceed the byte range, I'm happy with the solution

unsigned char kill_zero_table[256] = { 1, 1, 2, 3, 4, 5, 6, 7, [...] 255 };
[...]
result = x / kill_zero_table[y];

But this obviously does not work well for bigger ranges.

I guess the final question is: Whats the fastest bit twiddling hack changing 0 to any other integer value, while leaving all other values unchanged?


Clarifications

I'm not 100% sure that branching is too expensive. However, different compilers are used, so I prefer benchmarking with little optimizations (which is indeed questionable).

For sure, compilers are great when it comes to bit twiddling, but I can't express the "don't care" result in C, so the compiler will never be able to use the full range of optimizations.

Code should be fully C compatible, main platforms are Linux 64 Bit with gcc & clang and MacOS.

A_Pointar
  • 121
  • 1
  • 12
philipp
  • 1,745
  • 1
  • 14
  • 25
  • 22
    How have you determined that the if-branch is too expensive? – djechlin May 27 '13 at 16:56
  • 7
    How have you determined that there **is** a branch? – leemes May 27 '13 at 16:57
  • 13
    +1 for profiling, with modern day branch prediction you might not need this. Also, **why** are you coding your own image processing algorithms? – TC1 May 27 '13 at 16:57
  • 1
    Did you try compiling with the highest optimization level your compiler has to offer? Compilers are *amazing* at arcane bit twiddling magic. – thejh May 27 '13 at 16:59
  • 2
    I think this is overkill, look here http://www.nynaeve.net/?p=178 compiler is smart enough – kassak May 27 '13 at 17:03
  • 1
    This is processor/OS specific. What is the plaform? – Tyler Durden May 27 '13 at 17:04
  • 8
    "Whats the fastest bit twiddling hack..." Maybe `y += !y`? No branch needed to compute that. You could compare `x / (y + !y)` against `x / max(y, 1)` and maybe also `y ? (x/y) : 0`. I guess there will no branch in either of them, at least with optimizations turned on. – leemes May 27 '13 at 17:07
  • 1
    how about (x+1)!/x!,makes 0 to 1,leaving all other numbers intact. – rjv May 27 '13 at 17:08
  • 2
    @RajeevS Do you really mean taking the factorial of `x` and `x+1`? – leemes May 27 '13 at 17:10
  • @leemes I know it would take a lot of time,but thought it would be a good thing to suggest. – rjv May 27 '13 at 17:11
  • 2
    Depending how often you get a by zero division, it might be most efficent to simply catch the `SIGFPE` signal with an empty handler. Have a look at http://stackoverflow.com/q/4747934/991425 for how you can do it. – Haatschii May 27 '13 at 17:13
  • 2
    Why downvote this? Are micro optimisation questions disliked here? – someguy May 27 '13 at 17:20
  • 2
    result = x / (y+1) introduces a small error and is very fast. – brian beuning May 27 '13 at 17:21
  • 1
    @brian beuning: A significant error in most cases, e.g. x=6, y=2 – someguy May 27 '13 at 17:24
  • @someguy Unnecessary / unprofiled -- yes. Unless you can really prove that this is the bottle neck, you should rely on the compiler / CPU. With branch prediction and assuming the 0 case is fairly rare, this shouldn't even be an issue. – TC1 May 27 '13 at 17:39
  • 1
    @TC1: While it's true it may be unnecessary in his case, this could be useful for someone else that does need it. If the problem was a lot more ad-hoc, I'd agree with you, but I think it's a fairly interesting question on its own. Also, you're right about profiling the code; I don't think the answers here are sufficient for that reason. I wouldn't penalize the person asking the question for it, though. – someguy May 27 '13 at 18:00
  • 6
    Anyone who thinks modern day branch prediction means you don't have to do this hasn't profiled enough branch-elimination code that runs at a per-pixel level. Modern day branch prediction is acceptable if the alpha `0` sections are huge and contiguous. There is a place for fiddling around with micro optimizations, and per-pixel operations is *exactly* that place. – Yakk - Adam Nevraumont May 27 '13 at 18:00
  • @philipp and others: I've added some elaboration on compiler acknowledged idiom and predicated execution. – Bryan Olivier May 27 '13 at 22:12
  • 1
    *"Instead I just want a random result."* -- I know what you mean (you don't care what the result is), but that's not the clearest way to express it. The word "random" is often used to mean "arbitrary", but it really means something much more specific. I'm sure `y == 0 ? rand() ? x / y` wouldn't satisfy your requirement. – Keith Thompson May 28 '13 at 21:47
  • @Keith: Changed that and a few more details – philipp May 29 '13 at 09:35
  • If you're coding image processing algorithms and you care about speed, you're optimizing at completely the wrong level. Rewrite your code to use vector ops (SSE/SSE2/...), and it will get much faster. For even more speed, use a GPU. – user9876 May 29 '13 at 15:59
  • 1
    @user9876: I already have an SSE4 implementation, and SSE does not have problems dividing by zero. This question is about the fallback code in plain C for older processors and image border areas where SSE isn't applicable. – philipp May 29 '13 at 16:13
  • if `result = (y==0)? 0 : x/y;` is used, and `y==0`, then `result=0`, but if `result = x / MAX( y, 1 );` is used, and `y==0`, then `result=x`, right? So any solution that lead to any of the implementatios would be good for you, right? – Guillermo Gutiérrez Jun 03 '13 at 19:31
  • In ARM64 `x/0` results in 0 so you just need a simple `x/y` without any conditionals – phuclv Mar 15 '17 at 01:48

4 Answers4

107

Inspired by some of the comments I got rid of the branch on my Pentium and gcc compiler using

int f (int x, int y)
{
        y += y == 0;
        return x/y;
}

The compiler basically recognizes that it can use a condition flag of the test in the addition.

As per request the assembly:

.globl f
    .type   f, @function
f:
    pushl   %ebp
    xorl    %eax, %eax
    movl    %esp, %ebp
    movl    12(%ebp), %edx
    testl   %edx, %edx
    sete    %al
    addl    %edx, %eax
    movl    8(%ebp), %edx
    movl    %eax, %ecx
    popl    %ebp
    movl    %edx, %eax
    sarl    $31, %edx
    idivl   %ecx
    ret

As this turned out to be such a popular question and answer, I'll elaborate a bit more. The above example is based on programming idiom that a compiler recognizes. In the above case a boolean expression is used in integral arithmetic and the use of condition flags are invented in hardware for this purpose. In general condition flags are only accessible in C through using idiom. That is why it so hard to make a portable multiple precision integer library in C without resorting to (inline) assembly. My guess is that most decent compilers will understand the above idiom.

Another way of avoiding branches, as also remarked in some of the above comments, is predicated execution. I therefore took philipp's first code and my code and ran it through the compiler from ARM and the GCC compiler for the ARM architecture, which features predicated execution. Both compilers avoid the branch in both samples of code:

Philipp's version with the ARM compiler:

f PROC
        CMP      r1,#0
        BNE      __aeabi_idivmod
        MOVEQ    r0,#0
        BX       lr

Philipp's version with GCC:

f:
        subs    r3, r1, #0
        str     lr, [sp, #-4]!
        moveq   r0, r3
        ldreq   pc, [sp], #4
        bl      __divsi3
        ldr     pc, [sp], #4

My code with the ARM compiler:

f PROC
        RSBS     r2,r1,#1
        MOVCC    r2,#0
        ADD      r1,r1,r2
        B        __aeabi_idivmod

My code with GCC:

f:
        str     lr, [sp, #-4]!
        cmp     r1, #0
        addeq   r1, r1, #1
        bl      __divsi3
        ldr     pc, [sp], #4

All versions still need a branch to the division routine, because this version of the ARM doesn't have hardware for a division, but the test for y == 0 is fully implemented through predicated execution.

Bryan Olivier
  • 5,207
  • 2
  • 16
  • 18
  • Could you show us the resulting assembler code? Or how did you determine that there is no branch? – Haatschii May 27 '13 at 17:15
  • 1
    Awesome. Can be made `constexpr` and avoid needless type casts like this: `template constexpr auto fdiv( T t, U u ) -> decltype(t/(u+!u)) { return t/(u+!u); }` And if you want `255`, `(lhs)/(rhs+!rhs) & -!rhs` – Yakk - Adam Nevraumont May 27 '13 at 18:08
  • 1
    @leemes but I did mean `|` not `&`. Ooops -- `( (lhs)/(rhs+!rhs) ) | -!rhs` should set your value to `0xFFFFFFF` if `rhs` is `0`, and `lhs/rhs` if `rhs!=0`. – Yakk - Adam Nevraumont May 27 '13 at 18:20
  • 1
    Great answer! I usually resort to assembly for these kinds of things, but that is always horrible to maintain (not to mention less portable ;) ). – Leo May 28 '13 at 07:35
  • Make sure you benchmark before actually using this -- branches which can be well predicted are virtually free in modern CPUs, so tricks to avoid branching might end up a detriment to performance. Especially where heavy ops like division are involved. +1 for the great answer. – Cory Nelson May 29 '13 at 15:59
  • @Cory Of course you are right and on for example the PowerPC the branch prediction can be set according to profiling information. But don't get me started about benchmarking, because that might lead to a stampede of hobby horses. – Bryan Olivier May 29 '13 at 16:11
  • Was it intentional that you showed the *un-optimized* version of the compiler's assembly output? With optimizations enabled, a compiler gets even smarter, and the code becomes a lot more readable. Note, also, that with conditional move instructions (supported by x86 since the Pentium Pro), some compilers will generate even more efficient code if you just write `return x / (y == 0 ? 1 : y);` – Cody Gray - on strike Apr 23 '17 at 09:42
21

Here are some concrete numbers, on Windows using GCC 4.7.2:

#include <stdio.h>
#include <stdlib.h>

int main()
{
  unsigned int result = 0;
  for (int n = -500000000; n != 500000000; n++)
  {
    int d = -1;
    for (int i = 0; i != ITERATIONS; i++)
      d &= rand();

#if CHECK == 0
    if (d == 0) result++;
#elif CHECK == 1
    result += n / d;
#elif CHECK == 2
    result += n / (d + !d);
#elif CHECK == 3
    result += d == 0 ? 0 : n / d;
#elif CHECK == 4
    result += d == 0 ? 1 : n / d;
#elif CHECK == 5
    if (d != 0) result += n / d;
#endif
  }
  printf("%u\n", result);
}

Note that I am intentionally not calling srand(), so that rand() always returns exactly the same results. Note also that -DCHECK=0 merely counts the zeroes, so that it is obvious how often appeared.

Now, compiling and timing it various ways:

$ for it in 0 1 2 3 4 5; do for ch in 0 1 2 3 4 5; do gcc test.cc -o test -O -DITERATIONS=$it -DCHECK=$ch && { time=`time ./test`; echo "Iterations $it, check $ch: exit status $?, output $time"; }; done; done

shows output that can be summarised in a table:

Iterations → | 0        | 1        | 2        | 3         | 4         | 5
-------------+-------------------------------------------------------------------
Zeroes       | 0        | 1        | 133173   | 1593376   | 135245875 | 373728555
Check 1      | 0m0.612s | -        | -        | -         | -         | -
Check 2      | 0m0.612s | 0m6.527s | 0m9.718s | 0m13.464s | 0m18.422s | 0m22.871s
Check 3      | 0m0.616s | 0m5.601s | 0m8.954s | 0m13.211s | 0m19.579s | 0m25.389s
Check 4      | 0m0.611s | 0m5.570s | 0m9.030s | 0m13.544s | 0m19.393s | 0m25.081s
Check 5      | 0m0.612s | 0m5.627s | 0m9.322s | 0m14.218s | 0m19.576s | 0m25.443s

If zeroes are rare, the -DCHECK=2 version performs badly. As zeroes start appearing more, the -DCHECK=2 case starts performing significantly better. Out of the other options, there really isn't much difference.

For -O3, though, it is a different story:

Iterations → | 0        | 1        | 2        | 3         | 4         | 5
-------------+-------------------------------------------------------------------
Zeroes       | 0        | 1        | 133173   | 1593376   | 135245875 | 373728555
Check 1      | 0m0.646s | -        | -        | -         | -         | -
Check 2      | 0m0.654s | 0m5.670s | 0m9.905s | 0m14.238s | 0m17.520s | 0m22.101s
Check 3      | 0m0.647s | 0m5.611s | 0m9.085s | 0m13.626s | 0m18.679s | 0m25.513s
Check 4      | 0m0.649s | 0m5.381s | 0m9.117s | 0m13.692s | 0m18.878s | 0m25.354s
Check 5      | 0m0.649s | 0m6.178s | 0m9.032s | 0m13.783s | 0m18.593s | 0m25.377s

There, check 2 has no drawback compared the other checks, and it does keep the benefits as zeroes become more common.

You should really measure to see what happens with your compiler and your representative sample data, though.

  • 4
    Make 50% of the entries be `d=0` randomly, instead of making it almost always `d!=0`, and you'll see more branch prediction failures. Branch prediction is great if one branch is almost always followed, or if the following of one branch or the other is really clumpy... – Yakk - Adam Nevraumont May 27 '13 at 18:18
  • @Yakk The `d` iteration is the inner loop, so the `d == 0` cases are distributed evenly. And is making 50% of the cases `d == 0` realistic? –  May 27 '13 at 18:19
  • 2
    is making `0.002%` of the cases `d==0` realistic? They are distributed throughout, every 65000 iterations you hit your `d==0` case. While `50%` might not happen often, `10%` or `1%` could easily happen, or even `90%` or `99%`. The test as displayed only really tests "if you basically never, ever go down a branch, does branch prediction make removing the branch pointless?", to which the answer is "yes, but that isn't interesting". – Yakk - Adam Nevraumont May 27 '13 at 18:32
  • @Yakk Alright, that's a good point. I'll try to come up with some fairer test. –  May 27 '13 at 18:56
  • @Yakk Thanks, looks like you were right, and I've edited accordingly. –  May 27 '13 at 19:56
  • You do know that you are mostly timing the rand() function (which is a lot slower than any of the division methods)? – Joe May 27 '13 at 19:59
  • You need to grab a timer, strip the generation of data out of the calculation, then time the calculation only. Profiling can suck! – Yakk - Adam Nevraumont May 27 '13 at 20:07
  • @Joe Yes, I am aware of that. But how often rand() is called does not depend on `CHECK`, the *differences* in timing are purely due to the different possibilities for division. –  May 27 '13 at 20:12
  • 1
    No, because the differences will be effectively invisible due to the noise. – Joe May 27 '13 at 20:14
  • @Joe I consistently get the same results. Not down to the millisecond, of course, but enough to know that the three second difference is significant. –  May 27 '13 at 20:16
  • 3
    The distribution of zeros does not relate to the distribution found in the question asker's situation. Images containing a mix of 0 alpha and other have holes or irregular shape, but (usually) this is not noise. To assume you know nothing about the data (and consider it noise) is a mistake. This is a real world application with actual images which may have 0 alpha. And since a row of pixels is likely to have either all a=0 or all a>0, taking advantage of branch predication may very well be the fastest, especially when a=0 occurs a lot and (slow) divisions (15+ cycles!) are avoided. – DDS May 28 '13 at 19:28
  • @DDS: thanks, this is a point I haven't thought of. However, I've had this problem multiple times in the past with different algorithms, it's good to know the accepted answer. – philipp May 29 '13 at 17:22
13

Without knowing the platform there is no way to know the exact most efficient method, however, on a generic system this may close to the optimum (using Intel assembler syntax):

(assume divisor is in ecx and the dividend is in eax)

mov ebx, ecx
neg ebx
sbb ebx, ebx
add ecx, ebx
div eax, ecx

Four unbranched, single-cycle instructions plus the divide. The quotient will be in eax and the remainder will be in edx at the end. (This kind of shows why you don't want to send a compiler to do a man's job).

Tyler Durden
  • 11,156
  • 9
  • 64
  • 126
1

According to this link, you can just block the SIGFPE signal with sigaction() (I have not tried it myself, but I believe it should work).

This is the fastest possible approach if divide by zero errors are extremely rare: you only pay for the divisions by zero, not for the valid divisions, the normal execution path is not changed at all.

However, the OS will be involved in every exception that's ignored, which is expensive. I think, you should have at least a thousand good divisions per division by zero that you ignore. If exceptions are more frequent than that, you'll likely pay more by ignoring the exceptions than by checking every value before the division.

cmaster - reinstate monica
  • 38,891
  • 9
  • 62
  • 106