What is the fastest way to get the power of 2 less than a certain number?

Question

I am using this logic:

while ((chase<<(++n))< num) ;

where chase=1,n=0 initially and num is the value for which I want to find the power of 2 that is just less than it.

After the loop I simply apply

chase=1; 
chase<<(n-1);

Although I get the correct answer, I am just trying to find the fastest way to do it. Are there any faster methods?

You might find the answers to [this question](https://stackoverflow.com/questions/994593/how-to-do-an-integer-log2-in-c) useful. — jotik, Aug 09 '14 at 15:43
@DavidHeffernan: unless the values you intend to look up are pretty small, a looktable for this uses enormous amounts of memory. And, it may in fact be faster to do arithmetic in the registers (with machines do several instructions per nS, than to do random memory access with average access times of 40nS. Finally, most modern machines have special instructions to do this, that take at most a few clock cycles. So, the lookup table can be beaten. — Ira Baxter, Aug 09 '14 at 16:52
Focusing @jotik's pointer, consider this specific answer: http://stackoverflow.com/a/24748637/120163 — Ira Baxter, Aug 09 '14 at 17:02
you can find the next power of 2 and shift right back 1 bit. There are lots of answers here http://stackoverflow.com/questions/364985/algorithm-for-finding-the-smallest-power-of-two-thats-greater-or-equal-to-a-giv http://stackoverflow.com/questions/671815/what-is-the-fastest-most-efficient-way-to-find-the-highest-set-bit-msb-in-an-i?lq=1 http://graphics.stanford.edu/~seander/bithacks.html#RoundUpPowerOf2 The fastest way depends on which system you use, may be it has an instruction for that like BSR in x86 — phuclv, Aug 09 '14 at 17:48

AnT stands with Russia · Answer 1 · 2014-08-10T05:14:16.790

For positive integer v the power of 2 less or equal to v is 2^[log2(v)] (i.e. 1 << [log2(v)]), where [] in [log2(v)] stands for rounding down. (If you need a power of 2 that is specifically less than v, you can easily adjust the algorithm.)

For nonzero v, [log2(v)] is the index of the highest 1 bit in the binary representation of v.

You must already know all the above, since that is exactly what you do in your original code. However, it can be done more efficiently. (Of course, it is always a good idea to profile and see whether the new "more efficient" solution is actually more efficient on your data.)

The generic platform-independent trick for finding the index of the highest 1 bit is based on DeBruijn sequences. Here's an implementation for 32 bit integers

unsigned ulog2(uint32_t v)
{ /* Evaluates [log2 v] */
  static const unsigned MUL_DE_BRUIJN_BIT[] = 
  {
     0,  9,  1, 10, 13, 21,  2, 29, 11, 14, 16, 18, 22, 25,  3, 30,
     8, 12, 20, 28, 15, 17, 24,  7, 19, 27, 23,  6, 26,  5,  4, 31
  };

  v |= v >> 1;
  v |= v >> 2;
  v |= v >> 4;
  v |= v >> 8;
  v |= v >> 16;

  return MUL_DE_BRUIJN_BIT[(v * 0x07C4ACDDu) >> 27];
}

There are other bit-algorithmic solutions, which can be found here: https://graphics.stanford.edu/~seander/bithacks.html

However, if you need maximum efficiency, take note that many modern hardware platforms support this operation natively, and compilers provide intrinsics for accessing the corresponding hardware features. For example, in MSVC it might look as follows

unsigned ulog2(uint32_t v)
{ /* Evaluates [log2 v] */
  unsigned long i;
  _BitScanReverse(&i, v);
  return (unsigned) i;
}

while in GCC it might look as follows

unsigned ulog2(uint32_t v)
{ /* Evaluates [log2 v] */
  return 31 - __builtin_clz(v);
}

If you need 64-bit version of the functions, you can rewrite them in the same fashion. Or, due to nice properties of logarithm, you can easily build them by splitting the 64-bit value into two 32-bit halves and applying the 32-bit function to the highest-order non-zero half.

can you give any link for DeBruijn sequences!! It will be very helpful!! — geekx, Aug 10 '14 at 04:42
@Prateek Gupta: DeBruijn sequences is such a classic concept that you can find large amount of high-quality information by just doing a Google search for "DeBruijn sequence". Even the Wikipedia page is a good starting point. — AnT stands with Russia, Aug 10 '14 at 04:47

pmg · Answer 2 · 2014-08-09T17:37:17.057

0

The fastest way to get the greatest power of 2 less than a certain number is with a lookup array.

int lk[] = {-9999999, 0, 1, 2, 2, 4, 4, 4, 4, 8, ...};
printf("the greatest power of two less than %d is %d.\n", num, lk[num]);

edited Aug 09 '14 at 17:37

answered Aug 09 '14 at 16:27

pmg

106,608
13
126
198

And how big an array do I need for an input number like 2^31? – Ira Baxter Aug 09 '14 at 16:48
A 2^31 array, if you want to do it this way. – jforberg Aug 09 '14 at 16:55
It might be the fastest method indeed... Might... But nevertheless: are you serious??? – AnT stands with Russia Aug 09 '14 at 17:27
Half serious, yes! I mean this answers the question ... I believe the OP really wants to ask a different question. – pmg Aug 09 '14 at 17:39

score 0 · Answer 3 · answered Aug 09 '14 at 17:01

The answer depends on how quickly your machine can do memory access, compared to bitshifts and conditional jumps. Let's say your are using 16-bit numbers. Then you can either do 16 steps your while-based algorithm or first allocate a 65K array, then do lookups. I'd say lookup the best deal if you need to do many comparisons and the numbers are more or less random (not all very small or big). For 32-bit numbers however you would need a 4GB array, which is starting to sound less reasonable. For 64-bit numbers it would be infeasible.

score 0 · Answer 4 · answered Aug 10 '14 at 04:44

The question is too vague for a single answer. You cannot determine "The Speed" from the source code alone.

The OP's question is akin to "What is the fastest brick?"

The speed at which some C code runs depends on the compiler used, the architecture used (operating system and hardware), and sometimes even the runtime environment used. (Certain environment variables control the way C library functions work, on many OSes. Almost all C library functions can be interposed by custom functions without recompiling in Linux, if the user so desires.)

This is not even specific to C, but common to all programming languages. Even the speed of native binary code executed by a processor tends to vary from exact processor type to the next; the x86 processor family is a perfect example of this.

You can often tell if a brick is slow, as certain features tend to be always slow, or because a brick is clearly inferior (algorithmically or otherwise) -- say, more complicated than necessary to perform its function.

Only if you define the environment precisely enough, you can tell which brick is the fastest one.

However, it's easy to forget that there are a number of different types of bricks, too, from lego bricks to building materials to non-functioning electronic gadgets. In other words, there are several standards related to C: K&R C, C89, C99, C11, POSIX C standards, and even commonly available compiler built-ins and vector extensions.

Note: OP specified "power of two less than [argument]", which means that for unsigned integers,

argument  result
    0        0     (Invalid argument; all powers of two are larger)
    1        0     (Any negative number; but we use unsigned integers)
    2        1
    3        2
    4        2
    5        4
    6        4
    7        4
    8        4
    9        8

and so on.

Many CPU architectures have an assembly instruction to find the highest bit set in a number, like AndreyT described in another answer to this question. To find the power of two that is smaller than the specified argument, check if the highest bit set alone is equal to the argument, and if so, return the next smaller power of two:

#include <stdint.h>

#if defined(__GNUC__)
static int highest_bit_set(uint32_t value)
{
    if (sizeof (unsigned int) == sizeof value)
        return 31 - __builtin_clz(value);
    else
    if (sizeof (unsigned long) == sizeof value)
        return 31 - __builtin_clzl(value);
    else
        exit(127); /* Weird architecture! */
}
#elif defined (__MSC_VER)
static int highest_bit_set(unsigned long value)
{
    unsigned long result;
    if (_BitScanReverse(&result, value))
        return result;
    else
        return -1;
}
#endif

uint32_t smaller_power_of_two(uint32_t value)
{
    uint32_t result;

    if (!value)
        return 0;

    result = ((uint32_t)1) << highest_bit_set(value);

    if (result == value)
        return result >> 1;

    return result;
}

GCC optimizes the sizeof conditionals away at compile time, but does not produce anywhere near optimal assembly for x86. For x86, the AT&T assembly function using the bsr instruction can be simplified to

smaller_power_of_two:
    movl    4(%esp), %edx
    movl    $1, %eax
    cmpl    %eax, %edx
    jbe     .retzero
    bsrl    %ecx, %edx
    sall    %cl, %eax
    movl    %eax, %ecx
    shrl    %ecx
    cmpl    %eax, %edx
    cmovbe  %ecx, %eax
    ret

.retzero:
    xorl    %eax, %eax
    ret

which has two comparisons, one branch, and one conditional move. On a Cortex-M4 microcontroller, GCC-4.8 produces the following nice assembly code (preamble and abi flags omitted for brevity):

smaller_power_of_two:
    cbz     r0, .end1
    clz     r3, r0
    movs    r2, #1
    rsb     r3, r3, #31
    lsl     r3, r2, r3
    cmp     r3, r0
    it      ne
    movne   r0, r3
    beq     .end2
.end1:
    bx      lr
.end2:
    lsr     r0, r3, r2
    bx      lr

Random memory accesses tend to have a non-neglible cost, because cache is limited, and RAM accesses are slow compared to the arithmetic-logic capabilities of typical CPUs. In my experience, look-up tables on x86 architecture tend to lead to slower code than computing the value using a few simple arithmetic operations (additions, subtractions, bitwise AND/OR/NOT/XOR). Cache misses often show up elsewhere (as the look-up table takes up valuable cache, otherwised spent in caching some other data instead), so only considering the function that uses a look-up table does not yield a reliable estimate of the real-world performance, only an unrealistic "if the world were perfect" -type optimum estimate.

Assuming a generic C compiler, no compiler built-ins or inline assembly, the old value & (value - 1) trick that clears the least significant bit set is generally very fast:

uint32_t smaller_power_of_two(uint32_t value)
{
    uint32_t result;

    if (!value)
        return (uint32_t)0;

    do {
        result = value;
        value &= value - (uint32_t)1;
    } while (value);

    return result;
}

The above evaluates value at least once, at most the number of bits set in it, with an assignment, AND, and a subtraction (decrement) during each iteration. I'd probably use the above version in my own code, unless I had further information on the architecture and compiler used.

On some architectures conditional branching is very slow, so a variant based on the next greater power of two might be faster:

#include <stdint.h>

uint32_t smaller_power_of_two(uint32_t value)
{
    if (value > (uint32_t)2147483648UL)
        return (uint32_t)2147483648UL;
    else
    if (value < (uint32_t)2)
        return (uint32_t)0;

    value--;
    value |= value >> 1;
    value |= value >> 2;
    value |= value >> 4;
    value |= value >> 8;
    value |= value >> 16;

    return (value + (uint32_t)1) >> 1;
}

Without the if clauses this would only yield correct results for arguments between 1 and 2147483648, inclusive. Still, this code does not pipeline well, so on superscalar CPUs it will not utilize the CPU as well as it might.

If conditional branching (based on an integer comparison) is fast, then a binary search can be used. For 32-bit unsigned integers, at most six if conditions need to be evaluated for any possible input value, to reach the correct result:

uint32_t smaller_power_of_two(uint32_t value)
{
    if (value > (uint32_t)65536UL) {
        if (value > (uint32_t)16777216UL) {
            if (value > (uint32_t)268435456UL) {
                if (value > (uint32_t)1073741824UL) {
                    if (value > (uint32_t)2147483648UL) {
                        return (uint32_t)2147483648UL;
                    } else {
                        return (uint32_t)1073741824UL;
                    }
                } else {
                    if (value > (uint32_t)536870912UL) {
                        return (uint32_t)536870912UL;
                    } else {
                        return (uint32_t)268435456UL;
                    }
                }
            } else {
                if (value > (uint32_t)67108864UL) {
                    if (value > (uint32_t)134217728UL) {
                        return (uint32_t)134217728UL;
                    } else {
                        return (uint32_t)67108864UL;
                    }
                } else {
                    if (value > (uint32_t)33554432UL) {
                        return (uint32_t)33554432UL;
                    } else {
                        return (uint32_t)16777216UL;
                    }
                }
            }
        } else {
            if (value > (uint32_t)1048576UL) {
                if (value > (uint32_t)4194304UL) {
                    if (value > (uint32_t)8388608UL) {
                        return (uint32_t)8388608UL;
                    } else {
                        return (uint32_t)4194304UL;
                    }
                } else {
                    if (value > (uint32_t)2097152UL) {
                        return (uint32_t)2097152UL;
                    } else {
                        return (uint32_t)1048576UL;
                    }
                }
            } else {
                if (value > (uint32_t)262144UL) {
                    if (value > (uint32_t)524288UL) {
                        return (uint32_t)524288UL;
                    } else {
                        return (uint32_t)262144UL;
                    }
                } else {
                    if (value > (uint32_t)131072UL) {
                        return (uint32_t)131072UL;
                    } else {
                        return (uint32_t)65536UL;
                    }
                }
            }
        }
    } else {
        if (value > (uint32_t)256U) {
            if (value > (uint32_t)4096U) {
                if (value > (uint32_t)16384U) {
                    if (value > (uint32_t)32768U) {
                        return (uint32_t)32768U;
                    } else {
                        return (uint32_t)16384U;
                    }
                } else {
                    if (value > (uint32_t)8192U) {
                        return (uint32_t)8192U;
                    } else {
                        return (uint32_t)4096U;
                    }
                }
            } else {
                if (value > (uint32_t)1024U) {
                    if (value > (uint32_t)2048U) {
                        return (uint32_t)2048U;
                    } else {
                        return (uint32_t)1024U;
                    }
                } else {
                    if (value > (uint32_t)512U) {
                        return (uint32_t)512U;
                    } else {
                        return (uint32_t)256U;
                    }
                }
            }
        } else {
            if (value > (uint32_t)16U) {
                if (value > (uint32_t)64U) {
                    if (value > (uint32_t)128U) {
                        return (uint32_t)128U;
                    } else {
                        return (uint32_t)64U;
                    }
                } else {
                    if (value > (uint32_t)32U) {
                        return (uint32_t)32U;
                    } else {
                        return (uint32_t)16U;
                    }
                }
            } else {
                if (value > (uint32_t)4U) {
                    if (value > (uint32_t)8U) {
                        return (uint32_t)8U;
                    } else {
                        return (uint32_t)4U;
                    }
                } else {
                    if (value > (uint32_t)2U) {
                        return (uint32_t)2U;
                    } else {
                        if (value > (uint32_t)1U) {
                            return (uint32_t)1U;
                        } else {
                            return (uint32_t)0U;
                        }
                    }
                }
            }
        }
    }
}

Modern CPUs with branch prediction and speculative execution, predict or speculate on such a construct poorly. On 32-bit microcontrollers the highest-bit-set scanning version is often faster, and on 8- and 16-bit microcontrollers the comparisons are especially slow (as the values are actually made up of multiple native word). But who knows, maybe there is a hardware architecture with a fast conditional jump on 32-bit immediate-to-register comparison; this could very well be the fastest on such an architecture.

For more complicated functions with large argument range but small result range, a lookup table with the argument range endpoints (with optionally a matching result array), with a binary search, is often a very fast choice. For the function at hand, I would not use this approach.

So, which one of these is the fastest? It depends.

The above examples are not the only possible ways, just some relatively fast ways. If the argument range is smaller or greater, I might have different examples. If the arguments were not integers but floating-point numbers, you could use frexp() to obtain the power of two; if the result is exactly one, decrement the power by one. On architectures that use IEEE-754 binary32 or binary64 formats, and the argument is stored in memory, you can obtain the power of two just by type-punning the argument to an unsigned integer, and masking certain bits off it.

There is a reason certain libraries (Intel MKL, AMD Core Math Library) and even operating system kernels pick the best (fastest) version of a function at run time.

I mean, obviously, if we are looking for a single "fastest" algorithm, then it depends. But most likely the OP was looking for some ways to do it, to test out on his machine. — Alex, Jul 07 '17 at 16:53
@Alex: No, I disagree: I do believe OP (and others that find this question and these answers via a web search) are looking for a *"fastest"* or *"best"* way to do some operation. Questions involving *"fastest"*, *"best"*, and so on, are very common, and they stem from the basic competitiveness of the typical human; they are natural questions to ask. — Nominal Animal, Jul 07 '17 at 19:08

What is the fastest way to get the power of 2 less than a certain number?

4 Answers4

Linked