The question is too vague for a single answer.
You cannot determine "The Speed" from the source code alone.
The OP's question is akin to "What is the fastest brick?"
The speed at which some C code runs depends on the compiler used, the architecture used (operating system and hardware), and sometimes even the runtime environment used. (Certain environment variables control the way C library functions work, on many OSes. Almost all C library functions can be interposed by custom functions without recompiling in Linux, if the user so desires.)
This is not even specific to C, but common to all programming languages. Even the speed of native binary code executed by a processor tends to vary from exact processor type to the next; the x86 processor family is a perfect example of this.
You can often tell if a brick is slow, as certain features tend to be always slow, or because a brick is clearly inferior (algorithmically or otherwise) -- say, more complicated than necessary to perform its function.
Only if you define the environment precisely enough, you can tell which brick is the fastest one.
However, it's easy to forget that there are a number of different types of bricks, too, from lego bricks to building materials to non-functioning electronic gadgets. In other words, there are several standards related to C: K&R C, C89, C99, C11, POSIX C standards, and even commonly available compiler built-ins and vector extensions.
Note: OP specified "power of two less than [argument]", which means that for unsigned integers,
argument result
0 0 (Invalid argument; all powers of two are larger)
1 0 (Any negative number; but we use unsigned integers)
2 1
3 2
4 2
5 4
6 4
7 4
8 4
9 8
and so on.
Many CPU architectures have an assembly instruction to find the highest bit set in a number, like AndreyT described in another answer to this question. To find the power of two that is smaller than the specified argument, check if the highest bit set alone is equal to the argument, and if so, return the next smaller power of two:
#include <stdint.h>
#if defined(__GNUC__)
static int highest_bit_set(uint32_t value)
{
if (sizeof (unsigned int) == sizeof value)
return 31 - __builtin_clz(value);
else
if (sizeof (unsigned long) == sizeof value)
return 31 - __builtin_clzl(value);
else
exit(127); /* Weird architecture! */
}
#elif defined (__MSC_VER)
static int highest_bit_set(unsigned long value)
{
unsigned long result;
if (_BitScanReverse(&result, value))
return result;
else
return -1;
}
#endif
uint32_t smaller_power_of_two(uint32_t value)
{
uint32_t result;
if (!value)
return 0;
result = ((uint32_t)1) << highest_bit_set(value);
if (result == value)
return result >> 1;
return result;
}
GCC optimizes the sizeof
conditionals away at compile time, but does not produce anywhere near optimal assembly for x86. For x86, the AT&T assembly function using the bsr
instruction can be simplified to
smaller_power_of_two:
movl 4(%esp), %edx
movl $1, %eax
cmpl %eax, %edx
jbe .retzero
bsrl %ecx, %edx
sall %cl, %eax
movl %eax, %ecx
shrl %ecx
cmpl %eax, %edx
cmovbe %ecx, %eax
ret
.retzero:
xorl %eax, %eax
ret
which has two comparisons, one branch, and one conditional move. On a Cortex-M4 microcontroller, GCC-4.8 produces the following nice assembly code (preamble and abi flags omitted for brevity):
smaller_power_of_two:
cbz r0, .end1
clz r3, r0
movs r2, #1
rsb r3, r3, #31
lsl r3, r2, r3
cmp r3, r0
it ne
movne r0, r3
beq .end2
.end1:
bx lr
.end2:
lsr r0, r3, r2
bx lr
Random memory accesses tend to have a non-neglible cost, because cache is limited, and RAM accesses are slow compared to the arithmetic-logic capabilities of typical CPUs. In my experience, look-up tables on x86 architecture tend to lead to slower code than computing the value using a few simple arithmetic operations (additions, subtractions, bitwise AND/OR/NOT/XOR). Cache misses often show up elsewhere (as the look-up table takes up valuable cache, otherwised spent in caching some other data instead), so only considering the function that uses a look-up table does not yield a reliable estimate of the real-world performance, only an unrealistic "if the world were perfect" -type optimum estimate.
Assuming a generic C compiler, no compiler built-ins or inline assembly, the old value & (value - 1)
trick that clears the least significant bit set is generally very fast:
uint32_t smaller_power_of_two(uint32_t value)
{
uint32_t result;
if (!value)
return (uint32_t)0;
do {
result = value;
value &= value - (uint32_t)1;
} while (value);
return result;
}
The above evaluates value at least once, at most the number of bits set in it, with an assignment, AND, and a subtraction (decrement) during each iteration. I'd probably use the above version in my own code, unless I had further information on the architecture and compiler used.
On some architectures conditional branching is very slow, so a variant based on the next greater power of two might be faster:
#include <stdint.h>
uint32_t smaller_power_of_two(uint32_t value)
{
if (value > (uint32_t)2147483648UL)
return (uint32_t)2147483648UL;
else
if (value < (uint32_t)2)
return (uint32_t)0;
value--;
value |= value >> 1;
value |= value >> 2;
value |= value >> 4;
value |= value >> 8;
value |= value >> 16;
return (value + (uint32_t)1) >> 1;
}
Without the if
clauses this would only yield correct results for arguments between 1 and 2147483648, inclusive. Still, this code does not pipeline well, so on superscalar CPUs it will not utilize the CPU as well as it might.
If conditional branching (based on an integer comparison) is fast, then a binary search can be used. For 32-bit unsigned integers, at most six if
conditions need to be evaluated for any possible input value, to reach the correct result:
uint32_t smaller_power_of_two(uint32_t value)
{
if (value > (uint32_t)65536UL) {
if (value > (uint32_t)16777216UL) {
if (value > (uint32_t)268435456UL) {
if (value > (uint32_t)1073741824UL) {
if (value > (uint32_t)2147483648UL) {
return (uint32_t)2147483648UL;
} else {
return (uint32_t)1073741824UL;
}
} else {
if (value > (uint32_t)536870912UL) {
return (uint32_t)536870912UL;
} else {
return (uint32_t)268435456UL;
}
}
} else {
if (value > (uint32_t)67108864UL) {
if (value > (uint32_t)134217728UL) {
return (uint32_t)134217728UL;
} else {
return (uint32_t)67108864UL;
}
} else {
if (value > (uint32_t)33554432UL) {
return (uint32_t)33554432UL;
} else {
return (uint32_t)16777216UL;
}
}
}
} else {
if (value > (uint32_t)1048576UL) {
if (value > (uint32_t)4194304UL) {
if (value > (uint32_t)8388608UL) {
return (uint32_t)8388608UL;
} else {
return (uint32_t)4194304UL;
}
} else {
if (value > (uint32_t)2097152UL) {
return (uint32_t)2097152UL;
} else {
return (uint32_t)1048576UL;
}
}
} else {
if (value > (uint32_t)262144UL) {
if (value > (uint32_t)524288UL) {
return (uint32_t)524288UL;
} else {
return (uint32_t)262144UL;
}
} else {
if (value > (uint32_t)131072UL) {
return (uint32_t)131072UL;
} else {
return (uint32_t)65536UL;
}
}
}
}
} else {
if (value > (uint32_t)256U) {
if (value > (uint32_t)4096U) {
if (value > (uint32_t)16384U) {
if (value > (uint32_t)32768U) {
return (uint32_t)32768U;
} else {
return (uint32_t)16384U;
}
} else {
if (value > (uint32_t)8192U) {
return (uint32_t)8192U;
} else {
return (uint32_t)4096U;
}
}
} else {
if (value > (uint32_t)1024U) {
if (value > (uint32_t)2048U) {
return (uint32_t)2048U;
} else {
return (uint32_t)1024U;
}
} else {
if (value > (uint32_t)512U) {
return (uint32_t)512U;
} else {
return (uint32_t)256U;
}
}
}
} else {
if (value > (uint32_t)16U) {
if (value > (uint32_t)64U) {
if (value > (uint32_t)128U) {
return (uint32_t)128U;
} else {
return (uint32_t)64U;
}
} else {
if (value > (uint32_t)32U) {
return (uint32_t)32U;
} else {
return (uint32_t)16U;
}
}
} else {
if (value > (uint32_t)4U) {
if (value > (uint32_t)8U) {
return (uint32_t)8U;
} else {
return (uint32_t)4U;
}
} else {
if (value > (uint32_t)2U) {
return (uint32_t)2U;
} else {
if (value > (uint32_t)1U) {
return (uint32_t)1U;
} else {
return (uint32_t)0U;
}
}
}
}
}
}
}
Modern CPUs with branch prediction and speculative execution, predict or speculate on such a construct poorly. On 32-bit microcontrollers the highest-bit-set scanning version is often faster, and on 8- and 16-bit microcontrollers the comparisons are especially slow (as the values are actually made up of multiple native word). But who knows, maybe there is a hardware architecture with a fast conditional jump on 32-bit immediate-to-register comparison; this could very well be the fastest on such an architecture.
For more complicated functions with large argument range but small result range, a lookup table with the argument range endpoints (with optionally a matching result array), with a binary search, is often a very fast choice. For the function at hand, I would not use this approach.
So, which one of these is the fastest? It depends.
The above examples are not the only possible ways, just some relatively fast ways. If the argument range is smaller or greater, I might have different examples. If the arguments were not integers but floating-point numbers, you could use frexp()
to obtain the power of two; if the result is exactly one, decrement the power by one. On architectures that use IEEE-754 binary32 or binary64 formats, and the argument is stored in memory, you can obtain the power of two just by type-punning the argument to an unsigned integer, and masking certain bits off it.
There is a reason certain libraries (Intel MKL, AMD Core Math Library) and even operating system kernels pick the best (fastest) version of a function at run time.