55

I need to find the smallest power of two that's greater or equal to a given value. So far, I have this:

int value = 3221; // 3221 is just an example, could be any number
int result = 1;

while (result < value) result <<= 1;

It works fine, but feels kind of naive. Is there a better algorithm for that problem?


Related: Rounding up to next power of 2 has some C answers; C++20 std::bit_ceil() isn't available in C, so the ideas could be useful for older C++ code, too.

Most of the answers to this question predate C++20, but could still be useful if implementing a C++ standard library or compiler.

Also related: language-agnostic Given an integer, how do I find the next largest power of two using bit-twiddling? has a C++17 constexpr answer using GNU extensions.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Boyan
  • 3,645
  • 3
  • 22
  • 14

17 Answers17

69

Here's my favorite. Other than the initial check for whether it's invalid (<0, which you could skip if you knew you'd only have >=0 numbers passed in), it has no loops or conditionals, and thus will outperform most other methods. This is similar to erickson's answer, but I think that my decrementing x at the beginning and adding 1 at the end is a little less awkward than his answer (and also avoids the conditional at the end).

/// Round up to next higher power of 2 (return x if it's already a power
/// of 2).
inline int
pow2roundup (int x)
{
    if (x < 0)
        return 0;
    --x;
    x |= x >> 1;
    x |= x >> 2;
    x |= x >> 4;
    x |= x >> 8;
    x |= x >> 16;
    return x+1;
}

An answer on Given an integer, how do I find the next largest power of two using bit-twiddling? presents some explanation of how this common algorithm works, and examples of the bit-patterns for a couple inputs. (That versions uses unsigned, which allows avoiding the x<0 check and is generally better as discussed in comments.)

The same dec / shift/OR / inc strategy is found in:

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Larry Gritz
  • 13,331
  • 5
  • 42
  • 42
  • 5
    Except for minor syntax differences and the additional of that initial check, yours is also nearly identical to the version given in Henry S. Warren, Jr.'s "Hacker's Delight.". – Boojum Dec 13 '08 at 10:22
  • 1
    @Boojum: Thanks for mentioning that book. I've checked it out and it's got the solution I need, plus much more! – Boyan Dec 13 '08 at 10:36
  • Maybe this solution will be slower than doing it in Assembler, but it has the advantage of being completely portable. – Boyan Dec 13 '08 at 11:18
  • So was your solution, @Boyan, and I think yours was a little more readable. – paxdiablo Dec 13 '08 at 11:53
  • @Pax: Yes, but this one executes in constant time. And it's not that hard to understand, especially if you run it two-three times through the debugger :) – Boyan Dec 13 '08 at 12:34
  • 1
    Wouldn't "if (x <= 0)" be better than "if (x < 0)" ? – Dipstick Dec 13 '08 at 12:53
  • I agree, this is a better version of my approach (which I deleted). +1! – erickson Dec 13 '08 at 15:59
  • Well, I guess it's time to call it a day. Probably can't get more efficient than that solution. Thanks to all! – Boyan Dec 13 '08 at 19:14
  • 5
    @Boyan: This solution is not portable e.g., how does it work on x64? (it doesn't) – jfs Dec 13 '08 at 21:45
  • @J.F.: It works with a minor change. Could be done for both architectures with conditional compilation. – Boyan Dec 14 '08 at 05:05
  • 1
    A lookup table based approach can be up to twice as fast as this one. – Sparr Dec 17 '08 at 01:45
  • Actually, this algorithm resembles what a multiply instruction does. I wonder if there's a way to make use of that? – Mike Dunlavey Dec 17 '08 at 02:51
  • @Sparr But do you really want to store 4GB of data for a uint32_t? – fluffy Apr 01 '13 at 22:48
  • 1
    @fluffy I'm thinking of a table with 31 entries. – Sparr Apr 02 '13 at 16:19
  • @Sparr Ah. Is tbl[n] really faster than 1< – fluffy Apr 02 '13 at 23:05
  • We probably need some rigorous benchmarking to find out. I reckon that in a real application, the call pattern of this function will bear out whether @Sparr's LUT will be present in cache long enough to keep it ahead. The bit shifting approach is just a simple series of instructions, but the LUT is a good number less instructions but it has a moderately sized data segment. The LUT's data segment (the LUT itself...) is larger in size than the bit shifting approach's instruction count. So, the cost that comes with that LUT is the segmentation of the information into lookup data and instructions. – Steven Lu Aug 09 '14 at 08:49
  • http://www.hackersdelight.org/hdcodetxt/clp2.c.txt – Quinn Taylor Jan 24 '15 at 03:35
  • Why `return 0`? 0 is not a power of 2, so I think you should return 1 instead (2^0). – Konstantin Chernov Feb 01 '16 at 10:53
  • 4
    In addition to Hacker's Delight, as mentioned by @Boojum, this solution is also found almost verbatim in Bit Twiddling Hacks (2001; http://graphics.stanford.edu/~seander/bithacks.html#RoundUpPowerOf2) where it's credited to Sean Anderson, and even earlier in a Usenet thread (https://groups.google.com/forum/#!topic/comp.lang.python/xNOq4N-RffU) from 1997 by Pete Hart and William Lewis. – Maxy-B Mar 15 '16 at 22:23
  • I am disappointed that no one has bothered to explain why this works. In short, any number $x > 0$ will have at least one non-zero bit. Let $j$ be the index of the highest order set bit in $x$. Then the shift operations will set all bits at indices [0,j] to 1. Adding 1 returns $2^{j+1}$. This is exactly what we want except for when $x$ is already a power of 2. So we subtract one at the beginning. – xyz Jun 29 '20 at 00:52
  • 1
    Some problems here. It has undefined behaviour for zero input due to signed integer overflow. It also crucially depends on the size of an int. Both these problems can be solved simply by changing int to uint32_t. Although you could argue that the function should not actually be defined for x==0 or at least give the more correct answer of 1. – nattgris Sep 17 '21 at 08:59
  • @nattgris: For an input of `0`, the `x--` step produces `-1`. That's unsigned wraparound of the 2's complement bit-pattern, but it's not signed overflow. The right shifts are required to be implementation-defined for negative signed integers; sane implementations pick arithmetic right shift; it would also be legal to pick logical right shift. But on a 2's complement system the bit-pattern necessarily stays all-one because we OR with the original `-1`. Then `x++` goes from -1 to 0, in the middle of the signed range, nowhere near signed overflow. – Peter Cordes Feb 28 '23 at 04:14
  • @nattgris: The would be UB for `INT_MIN`, where `x--` overflows if it didn't take an early-out on negative `x`. So mostly the gain from using `uint32_t` is efficiency and simplicity, and value-range. And lack of dependency on 2's complement signed integers at least for the `0` input case. 100% agreed it's not good to use `int` here, any unsigned type would be better, but the problems don't include UB. – Peter Cordes Feb 28 '23 at 04:20
21
ceil(log2(value))

ilog2() can be calculated in 3 asm instructions e.g., http://www.asterisk.org/doxygen/1.4/log2comp_8h-source.html

jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • That's something I considered, but won't it be slower than my bit-pushing solution? – Boyan Dec 13 '08 at 08:20
  • I think this will be slower than the loop mentioned in the question – Naveen Dec 13 '08 at 08:24
  • 2
    There is no slow feature in this but rather, taking the result of log2(value) and rounding it up to the nearest whole number cannot be beaten in efficiency when the log table lookups are all pre-computed – Supernovah Dec 13 '08 at 08:36
  • 5
    Log-base-2 generally takes a float as an argument. Are you saying you have a lookup table with an entry for every possible float? I hope not... Of course, the fastest way is a lookup table with 2^32 entries but that's a bit memory-expensive. – paxdiablo Dec 13 '08 at 08:41
  • 1
    @J.F.: If we throw in some Assembler, your solution really looks better. Thanks for the link! – Boyan Dec 13 '08 at 09:24
  • 4
  • 3
    The link is broken... – moala Aug 20 '16 at 12:39
  • 1
    Note it's only really 1 asm instruction, the 3 instructions for x86 are to handle the special case of `0` as an input, taking advantage of the fact that all real CPUs leave BSR's destination unmodified when the input is 0. (AMD documents this, Intel does it too but documents the result as an undefined value.) It generates `-1` with a funky code-size optimization for 32-bit mode, using `xor`-zeroing and `dec` instead of a more efficient `mov $-1, %eax` for 2 insns. Modern x86 can use `lzcnt`, but `lzcnt(0) ==32`, instead of the `-1` that this code wants. `1<<-1` on x86 is `1<<31`; it masks – Peter Cordes Feb 28 '23 at 04:26
12

On Intel hardware the BSR instruction is close to what you want - it finds the most-significant-set-bit. If you need to be more precise you can then wonder if the remaining bits are precisely zero or not. I tend to assume that other CPU's will have something like BSR - this is a question you want answered to normalize a number. If your number is more than 32 bits then you would do a scan from your most-significant-DWORD to find the first DWORD with ANY bits set. Edsger Dijkstra would likely remark that the above "algorithms" assume that your computer uses Binary Digits, while from his kind of lofty "algorithmic" perspective you should think about Turing machines or something - obviously I am of the more pragmatic style.

pngaz
  • 377
  • 2
  • 10
  • Yes, maybe I'll do some Assembler. After finding the most significant set bit, I think I can do: if (~most_sign_bit & value) to find if I have to left-shift the value once. Is that correct? – Boyan Dec 13 '08 at 08:46
  • 2
    I looked in MSDN and there is a compiler instrinsic called _BitScanReverse() - this is better than assembler because you can't do inline assember in x64 and you don't want to waste a procedure call to an external x64 routine. Assuming you're using MS compilers of course. – pngaz Dec 13 '08 at 08:52
  • The (~MSB & value ) sounds perfect - of course single-stepping will tell for sure ! – pngaz Dec 13 '08 at 08:54
  • There is a bit of cleanup to do since both 4 and 5 will return 2 for the MSB, while the right answer for for those values are 4 and 8 respectively. Nevertheless, I like the BSR solution -- I tend to forget about that instruction. – DocMax Dec 13 '08 at 09:29
  • @DocMax: Yes, that's why I'll use (~MSB & value) after BSR to find out if I need one left shift. – Boyan Dec 13 '08 at 09:36
  • Ah, my mistake. I had misread the approach. I do not know what you are thinking for the check, but you can avoid a branch by using BSF like: mov eax,x; bsr ecx,eax; bsf edx,eax; sub edx,ecx; adc ecx,0; mov ebx,1; shl ebx,cl. (And assembly on a single line reads like a chess game. :) – DocMax Dec 13 '08 at 09:42
  • pow2roundup() above with the 8 votes solves exactly the prb posed. To create a normalized FP number you must still find the index of the MSB, so BSR does have a mission, but it's work you don't need if ALL you need is roundup - must admit I didn't know of pow2roundup() - clever ! – pngaz Dec 14 '08 at 07:25
  • Google Query (( BSR XEON CLOCK )) has 3rd hit "Nearest power of 2 - GameDev.Net Discussion Forums". These guys benched BSR and a Flt-Pt-Trick, evidently such things are constant-time on XEON. _BitScanReverse() is not mentioned - inlining matters as the actual operation is so fast on such machines. – pngaz Dec 14 '08 at 08:46
12

In the spirit of Quake II's 0x5f3759df and the Bit Twiddling Hacks' IEEE version - this solution reaches into a double to extract the exponent as a means to calculate floor(lg2(n)). It's a bit faster than the accepted solution and much faster than the Bit Twiddling IEEE version since it avoids floating point math. As coded, it assumes a double is a real*8 IEEE float on a little endian machine.

int nextPow2(int n) 
{ 
    if ( n <= 1 ) return n;
    double d = n-1; 
    return 1 << ((((int*)&d)[1]>>20)-1022); 
} 

Edit: Add optimized x86 assembly version with the help of a co-worker. A 4% speed gain but still about 50% slower than a bsr version (6 sec vs 4 on my laptop for n=1..2^31-2).

int nextPow2(int n) 
{ 
    if ( n <= 1 ) return n;
    double d;
    n--;
    __asm {
      fild    n 
      mov     eax,4
      fstp    d 
      mov     ecx, dword ptr d[eax]
      sar     ecx,14h 
      rol     eax,cl 
  }
} 
Tony Lee
  • 5,622
  • 1
  • 28
  • 45
  • Is this really efficient comparing to BSR instruction? – Explorer09 Jul 17 '21 at 22:31
  • @Explorer09: No, not remotely. The pure C version (with strict-aliasing violation) is just a way to take advantage of the leading-zero finding of int->FP conversion on a C implementation without intrinsics for bit-scan instructions like GNU C `__builtin_clz`. `bsr reg,reg` is 1 uop, 3c latency on Intel. https://uops.info/ (Or 6 uops with 4c latency, 4c throughput on AMD Zen 3 for example; `lzcnt` is fast on AMD and same speed on Intel, so prefer that if you can.) This is >6 single-uop instructions including store/reload latency, including several count outside the inline-asm block. – Peter Cordes Mar 02 '22 at 15:28
  • 1
    @Explorer09: If ISO C would stop sucking and provide portable functions for this stuff, we wouldn't have any reason to hack up horrible code like this. Platforms with an FPU but without integer bit-scan instructions might use something like this as the implementation as a worst case, but platforms like x86 could just use BSR. ISO C++20 *finally* got around to doing this with [`std::countl_zero(T)`](https://en.cppreference.com/w/cpp/numeric/countl_zero). – Peter Cordes Mar 02 '22 at 15:31
  • @Explorer09: So TL:DR: if you're going to use x86 inline asm at all, it's very silly to keep using the FPU instead of `mov ecx, n` / `bsr ecx, ecx` (not `bsr ecx, n`, that would have a [false dependency on the old value of `n`](https://stackoverflow.com/questions/21390165/why-does-breaking-the-output-dependency-of-lzcnt-matter)), and adapt it to use the bit-index instead of the double exponent. Or better, use intrinsics for BSR and ROL to avoid inline asm entirely, especially MSVC style that forces spilling inputs to memory because the syntax sucks too much to have them in registers. – Peter Cordes Mar 03 '22 at 02:57
8

Here's a template version of the bit shifting technique.

template<typename T> T next_power2(T value)
{
    --value;
    for(size_t i = 1; i < sizeof(T) * CHAR_BIT; i*=2)
        value |= value >> i;
    return value+1;
}

Since the loop only uses constants it gets flattened by the compiler. (I checked) The function is also future proof.

Here's one that uses __builtin_clz. (Also future proof)

template<typename T> T next_power2(T value)
{
    return 1 << ((sizeof(T) * CHAR_BIT) - __builtin_clz(value-1));
}
Zacrath
  • 521
  • 3
  • 8
  • If `T` is `unsigned long` or `unsigned long long`, you'll want `__builtin_clzll` and `1ULL << ...` – Peter Cordes Mar 02 '22 at 15:33
  • In C++20, use `std::bit_width` (index of MSB = x86 BSR = 32-clz. In fact, C++20 even has [`std::bit_ceil`](https://en.cppreference.com/w/cpp/numeric/bit_ceil) which is the answer to this whole question. (Its example implementation uses `T(1) << ...` to make sure we don't overflow shifting an `int` `1` before conversion to `T`.) – Peter Cordes Mar 03 '22 at 03:07
6

Your implementation is not naive, it's actually the logical one, except that it's wrong - it returns a negative for numbers greater that 1/2 the maximum integer size.

Assuming you can restrict numbers to the range 0 through 2^30 (for 32-bit ints), it'll work just fine, and a lot faster than any mathematical functions involving logarithms.

Unsigned ints would work better but you'd end up with an infinite loop (for numbers greater than 2^31) since you can never reach 2^32 with the << operator.

paxdiablo
  • 854,327
  • 234
  • 1,573
  • 1,953
  • Yes, the values would be greater than zero and much less than 2^31. – Boyan Dec 13 '08 at 08:27
  • 1
    Then what you have is about as fast as it's going to get. I don't doubt there's a boolean algebra solution that'll do it in 2 or three operations, but you'd sacrifice lots of readability for a very small performance gain. – paxdiablo Dec 13 '08 at 08:29
  • I think you might be able to speed it up using a binary search: Initialize result with the (statically) computed middle of the number range and shift left / right. The more steps you precompute on this search tree, the lower the average shift-amount becomes – Tetha Dec 13 '08 at 09:56
  • I just dont know if this really will be faster, because for many ranges the solution up there could be faster despite the higher amount of shifts. – Tetha Dec 13 '08 at 09:56
5

pow ( 2 , ceil( log2(value) );

log2(value) = log(value) / log(2);

Sorana
  • 59
  • 1
  • 2
4

An exploration of the possible solutions to closely related problem (that is, rounding down instead of up), many of which are significantly faster than the naive approach, is available on the Bit Twiddling Hacks page, an excellent resource for doing the kinds of optimization you are looking for. The fastest solution is to use a lookup table with 256 entries, that reduces the total operation count to around 7, from an average of 62 (by a similar operation counting methodology) for the naive approach. Adapting those solutions to your problem is a matter of a single comparison and increment.

sudo make install
  • 5,629
  • 3
  • 36
  • 48
Sparr
  • 7,489
  • 31
  • 48
  • Link is out-of-date. Newer link: http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogLookup That gives a value of r = 0 to 31, then do `1 << r` to get the power of 2. – ToolmakerSteve Sep 06 '14 at 05:24
3

How about a recursive template version to generate a compile constant:

template<uint32_t A, uint8_t B = 16>
struct Pow2RoundDown { enum{ value = Pow2RoundDown<(A | (A >> B)), B/2>::value }; };
template<uint32_t A>
struct Pow2RoundDown<A, 1> { enum{ value = (A | (A >> 1)) - ((A | (A >> 1)) >> 1) }; };

template<uint32_t A, uint8_t B = 16>
struct Pow2RoundUp { enum{ value = Pow2RoundUp<((B == 16 ? (A-1) : A) | ((B == 16 ? (A-1) : A) >> B)), B/2>::value }; };
template<uint32_t A >
struct Pow2RoundUp<A, 1> { enum{ value = ((A | (A >> 1)) + 1) }; };

Can be used like so:

Pow2RoundDown<3221>::value, Pow2RoundUp<3221>::value
duncan.forster
  • 171
  • 1
  • 1
3

You don't really say what you mean by "better algorithm" but as the one you present is perfectly clear (if somewhat flawed), I'll assume you are after a more efficient algorithm.

Larry Gritz has given what is probably the most efficient c/c++ algorithm without the overhead of a look up table and it would suffice in most cases (see http://www.hackersdelight.org for similar algorithms).

As mentioned elsewhere most CPUs these days have machine instructions to count the number of leading zeroes (or equivalently return the ms set bit) however their use is non-portable and - in most cases - not worth the effort.

However most compilers have "intrinsic" functions that allow the use of machine instructions but in a more portable way.

Microsoft C++ has _BitScanReverse() and gcc provides __builtin_clz() to do the bulk of the work efficiently.

Dipstick
  • 9,854
  • 2
  • 30
  • 30
2

In standard C++20, a template in <bit> does this: cppreference.

#include <bit>
unsigned long upper_power_of_two(unsigned long v)
{
    return std::bit_ceil(v);
}

Only unsigned integer types participate in overload resolution if you don't use the bit_ceil<T> template parameter explicitly.

Beware that bit_ceil has undefined behaviour if the result isn't representable in the input type, not just a garbage result.
This applies even for unsigned integer types, where arithmetic is well-defined to wrap.

For example, std::bit_ceil(-123) will implicitly convert that signed int input to unsigned, so it will operate on -123u, e.g. 0xffffff85u on a system with 32-bit int. The correct result would take 33 bits, more than the width of unsigned, so the behaviour is undefined.

This is true for negative inputs in general on 2's complement systems, except for INT_MIN / LONG_MIN etc. which have the same bit-pattern as 1u<<(n-1), i.e. 2**(n-1)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Jake Schmidt
  • 1,558
  • 9
  • 16
1

My version of the same:

int pwr2Test(size_t x) {
    return (x & (x - 1))? 0 : 1; 
}

size_t pwr2Floor(size_t x) {
    // A lookup table for rounding up 4 bit numbers to
    // the nearest power of 2.
    static const unsigned char pwr2lut[] = {
        0x00, 0x01, 0x02, 0x02,     //  0,  1,  2,  3
        0x04, 0x04, 0x04, 0x04,     //  4,  5,  6,  7
        0x08, 0x08, 0x08, 0x08,     //  8,  9, 10, 11
        0x08, 0x08, 0x08, 0x08      // 12, 13, 14, 15
    };

    size_t pwr2 = 0;                // The return value
    unsigned int i = 0;             // The nybble interator

    for( i = 0; x != 0; ++i ) {     // Iterate through nybbles
        pwr2 = pwr2lut[x & 0x0f];   // rounding up to powers of 2.
        x >>= 4;                    // (i - 1) will contain the
    }                               // highest non-zero nybble index.

    i = i? (i - 1) : i;
    pwr2 <<= (i * 4);
    return pwr2; 
}

size_t pwr2Size(size_t x) {
    if( pwr2Test(x) ) { return x; }
    return pwr2Floor(x) * 2; 
 }
natersoz
  • 1,674
  • 2
  • 19
  • 29
1

The code below repeatedly strips the lowest bit off until the number is a power of two, then doubles the result unless the number is a power of two to begin with. It has the advantage of running in a time proportional to the number of bits set. Unfortunately, it has the disadvantage of requiring more instructions in almost all cases than either the code in the question or the assembly suggestions. I include it only for completeness.

int nextPow(int x) {
  int y = x
  while (x &= (x^(~x+1))) 
    y = x << 1;
  return y
}
DocMax
  • 12,094
  • 7
  • 44
  • 44
1

I know this is downvote-bait, but if the number is small enough (like 8 or 16-bits) a direct lookup might be fastest.

// fill in the table
unsigned short tab[65536];
unsigned short bit = tab[i];

It might be possible to extend it to 32 bits by first doing the high word and then the low.

//
unsigned long bitHigh = ((unsigned long)tab[(unsigned short)(i >> 16)]) << 16;
unsigned long bitLow = 0;
if (bitHigh == 0){
    bitLow = tab[(unsigned short)(i & 0xffff)];
}
unsigned long answer = bitHigh | bitLow;

It's probably no better that the shift-or methods, but maybe could be extended to larger word sizes.

(Actually, this gives the highest 1-bit. You'd have to shift it left by 1 to get the next higher power of 2.)

Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
  • Using a 256-entry lookup table and using the results for all 4 bytes of a 4-byte word is quite viable. The memory/speed tradeoff to use a 65536 entry table is not great (14% speedup, 25500% memory increase), – Sparr Dec 18 '08 at 21:52
  • @Sparr. You're right. It's hard to beat the shift-or method, but it's interesting to keep an eye out for more ways to skin cats. – Mike Dunlavey Dec 18 '08 at 22:24
  • 1
    Once I heard a Dijkstra lecture on fun algorithms. He had a student who did an n-bit rotate by doing a sequence of 3 bit-reversals. – Mike Dunlavey Dec 18 '08 at 22:29
0

i love the shift.

i'll settle for

    int bufferPow = 1;
    while ( bufferPow<bufferSize && bufferPow>0) bufferPow <<= 1;

that way the loop always terminates, and the part after && is evaluated almost never. And i do not think two lines are worth a function call. Also you can make a long, or short, depending on your judgment, and it is very readable. (if bufferPow becomes negative, hopefully your main code will exit fast.)

Usually you compute 2-power only once at the start of an algorithm, so optimizing would be silly anyway. However, would be interested if anyone bored enough would care for a speed contest... using the above examples and 255 256 257 .. 4195 4196 4197

Kos Petoussis
  • 41
  • 1
  • 5
0

An arbitrary log function can be converted to a log base 2 by dividing by the log of 2:

$ /usr/local/pypy-1.9/bin/pypy
Python 2.7.2 (341e1e3821ff, Jun 07 2012, 15:38:48)
[PyPy 1.9.0 with GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
And now for something completely different: ``<arigato> yes but there is not
much sense if I explain all about today's greatest idea if tomorrow it's
completely outdated''
>>>> import math
>>>> print math.log(65535)/math.log(2)
15.9999779861
>>>> print math.log(65536)/math.log(2)
16.0
>>>>

It of course won't be 100% precise, since there is floating point arithmetic involved.

user1277476
  • 2,871
  • 12
  • 10
-2

This works and is really fast (on my 2.66 GHz Intel Core 2 Duo 64-bit processor).

#include <iostream>
int main(void) {
    int testinput,counter;
    std::cin >> testinput;
    while (testinput > 1) {
        testinput = testinput >> 1;
        counter++;
    }
    int finalnum = testinput << counter+1;
    printf("Is %i\n",finalnum);
    return 0;
}

I tested it on 3, 6, and 65496, and correct answers (4, 8, and 65536) were given.

Sorry if this seems a bit arcane, I was under the influence of a couple of hours of Doom just before writing. :)