I'm curious if anyone could explain a branchless binary search implementation to me. I saw it mentioned in a recent question but I can't imagine how it would be implemented. I assume it could be useful to avoid branches if the number of items is quite large.
2 Answers
I'm going to assume you're talking about the sentence "Make a static const
array of all the perfect squares in the domain you want to support, and perform a fast branchless binary search on it." found in this answer.
A "branchless" binary search is basically just an unrolled binary search loop. This only works if you know in advance the number of items in the array you're searching (as you would if it's static const
). You can write a program to write the unrolled code if it's too long to do by hand.
Then, you must benchmark your solution to see whether it really is faster than a loop. If your branchless code is too big, it won't fit inside the CPU's fast instruction cache and will take longer to run than the equivalent loop.

- 1
- 1

- 951,095
- 183
- 1,149
- 1,285
-
Ah okay I get it now thanks. I assumed there was some weird bit twiddling to avoid the if statements inside the loop. – GWW Feb 08 '11 at 19:01
-
+1 because this is indeed a useful way of removing loop branches, though from a later comment on that question it seems that R. intended the "twiddle bits to avoid conditional branches" meaning instead. By doing both you could in fact avoid *all* branches. – j_random_hacker Feb 09 '11 at 08:59
-
2Such "branchless" unrolled search is featured in Bentley's "Programming Pearls", IIRC, where he explains the technique thoroughly – Eli Bendersky May 09 '11 at 08:33
-
@GWW: I created an implementation of this branchless binary search, but it gets blown out of the water by the regular one. Perhaps I left some optimizations on the table? https://gist.github.com/3481295 – orlp Aug 26 '12 at 16:33
-
How would you go about benchmarking issues that touch upon code size? It sounds like only a real life implementation (full application) will paint the realistic picture. – bluss Jan 17 '17 at 15:52
-
@bluss: That's true. Benchmarking must always be done within an appropriate context. Usually you can get away with a small test program, but sometimes you need to do it within the context of the complete system. It all depends on your particular circumstances. – Greg Hewgill Jan 17 '17 at 16:38
If one has a function which returns +1, -1, or 0 based upon the position of the correct item versus the current one, one could initialize position to list size/2, and stepsize to position/2, and then after each comparison do position+=direction*stepsize; stepsize=stepsize/2. Iterate until stepsize is zero.

- 77,689
- 9
- 166
- 211
-
"Iterate until stepsize is zero." How is it branchless is you're doing a branch not zero? – rlibby Feb 08 '11 at 19:06
-
If one knows the starting value, one can unroll the loop the required number of times. Alternatively, some hardware platforms (especially DSP's) support a Fortran-style "DO" loop in hardware; when an instruction is fetched from an address matching the loop end address, the loop counter will decrement and the program counter will be loaded with the loop-start address. Zero cycles overhead for the loop. – supercat Feb 08 '11 at 19:14
-
computed branches are still branches. Loop unrolling is branchless (or at least it can be), but it doesn't use any conditions like "until stepsize is zero." – rlibby Feb 08 '11 at 21:27
-
@rlibby: DSP loops are not branches, since the execution unit sees a continuous stream of (non-branch) instructions with the instructions in the loop repeated an appropriate number of times. As for the terminating condition, if one knows the starting value, one can determine rather easily the maximum number of loops that may be required. Since iterating more times than necessary will be harmless, simply code for the maximum if desired. Since a binary search cannot be performed in O(1) time, it can only be branchless if coded for a particular maximum array size. – supercat Feb 08 '11 at 23:54