14

I came up with three solutions so far:

The extremely inefficient standard library pow and log2 functions:

int_fast16_t powlog(uint_fast16_t n)
{
  return static_cast<uint_fast16_t>(pow(2, floor(log2(n))));
}

Far more efficient counting subsequent powers of 2 until I reach a greater number than I had to reach:

uint_fast16_t multiply(uint_fast16_t n)
{
  uint_fast16_t maxpow = 1;
  while(2*maxpow <= n)
    maxpow *= 2;
  return maxpow;
}

The most efficient so far binsearching a precomputed table of powers of 2:

uint_fast16_t binsearch(uint_fast16_t n)
{
  static array<uint_fast16_t, 20> pows {1,2,4,8,16,32,64,128,256,512,
    1024,2048,4096,8192,16384,32768,65536,131072,262144,524288};

  return *(upper_bound(pows.begin(), pows.end(), n)-1);
}

Can this be optimized even more? Any tricks that could be used here?

Full benchmark I used:

#include <iostream>
#include <chrono>
#include <cmath>
#include <cstdint>
#include <array>
#include <algorithm>
using namespace std;
using namespace chrono;

uint_fast16_t powlog(uint_fast16_t n)
{
  return static_cast<uint_fast16_t>(pow(2, floor(log2(n))));
}

uint_fast16_t multiply(uint_fast16_t n)
{
  uint_fast16_t maxpow = 1;
  while(2*maxpow <= n)
    maxpow *= 2;
  return maxpow;
}

uint_fast16_t binsearch(uint_fast16_t n)
{
  static array<uint_fast16_t, 20> pows {1,2,4,8,16,32,64,128,256,512,
    1024,2048,4096,8192,16384,32768,65536,131072,262144,524288};

  return *(upper_bound(pows.begin(), pows.end(), n)-1);
}

high_resolution_clock::duration test(uint_fast16_t(powfunct)(uint_fast16_t))
{
  auto tbegin = high_resolution_clock::now();
  volatile uint_fast16_t sink;
  for(uint_fast8_t i = 0; i < UINT8_MAX; ++i)
    for(uint_fast16_t n = 1; n <= 999999; ++n)
      sink = powfunct(n);
  auto tend = high_resolution_clock::now();
  return tend - tbegin;
}

int main()
{
  cout << "Pow and log took " << duration_cast<milliseconds>(test(powlog)).count() << " milliseconds." << endl;
  cout << "Multiplying by 2 took " << duration_cast<milliseconds>(test(multiply)).count() << " milliseconds." << endl;
  cout << "Binsearching precomputed table of powers took " << duration_cast<milliseconds>(test(binsearch)).count() << " milliseconds." << endl;
}

Compiled with -O2 this gave the following results on my laptop:

Pow and log took 19294 milliseconds.
Multiplying by 2 took 2756 milliseconds.
Binsearching precomputed table of powers took 2278 milliseconds.
  • What _kind_ of power of two do you need? A whole number like `2**3` or any (ir)rational like `2**(1/2)` or `2**0.1`? The most efficient method of generating _whole_ powers of two is bit shifting, so, you don't actually need `pow(2, integer)`, you can do `2 << integer` instead. – ForceBru Mar 04 '17 at 11:22
  • @ForceBru Whole numbers… just as shown in my benchmark –  Mar 04 '17 at 11:23
  • 2
    I think your benchmark is broken due to branch prediction. You should use random values for `n`. – YSC Mar 04 '17 at 11:33
  • find the most significant non-zero bit. you might possibly increase speed by first finding most significant non-zero byte. i had trouble understanding this question because this direct tactic doesn't seem to be among those you've tried. i also failed to see any rationale for the `volatile`. – Cheers and hth. - Alf Mar 04 '17 at 11:37
  • 3
    https://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious – Mark Dickinson Mar 04 '17 at 11:39
  • @Cheersandhth.-Alf `volatile` is so that the calculations don’t get optimized out. –  Mar 04 '17 at 11:39
  • @MarkDickinson Isn’t this practically identical to dividing by two and then again calculating power? –  Mar 04 '17 at 11:41
  • 4
    did you mean to ask "how to efficiently **find** ..." (rather than "count")? – Walter Mar 04 '17 at 13:46
  • 2
    You know that the last elements of your table exceed the range of a 16 bit integer right? – MikeMB Mar 04 '17 at 13:50
  • if you really need speed then i would consider just manually writing the binary search tree, i got 33% speed increase from the using of `fast_upper_bound` shown in an answer below. It's slower than bitshifting though. – Sopel Mar 04 '17 at 14:10
  • you can't "count" the highest power of 2 because *there's only one highest value* – phuclv Mar 04 '17 at 14:42
  • On the second don't do the calculation twice – paparazzo Mar 04 '17 at 15:06
  • @LưuVĩnhPhúc there are relevant techniques there, but it's about rounding up and here we round down – harold Mar 04 '17 at 15:27
  • @harold rounding down is just like rounding up and shift right by 1, or sometimes just a change in a constant – phuclv Mar 04 '17 at 15:34

6 Answers6

18

Versions with intrinsics have already been suggested in the comments, so here's a version that does not rely on them:

uint32_t highestPowerOfTwoIn(uint32_t x)
{
  x |= x >> 1;
  x |= x >> 2;
  x |= x >> 4;
  x |= x >> 8;
  x |= x >> 16;
  return x ^ (x >> 1);
}

This works by first "smearing" the highest set bit to the right, and then x ^ (x >> 1) keeps only the bits that differ from the bit directly left of them (the msb is considered to have a 0 to left of it), which is only the highest set bit because thanks to the smearing the number is of the form 0n1m (in string notation, not numerical exponentiation).


Since no one is actually posting it, with intrinsics you could write (GCC, Clang)

uint32_t highestPowerOfTwoIn(uint32_t x)
{
  return 0x80000000 >> __builtin_clz(x);
}

Or (MSVC, probably, not tested)

uint32_t highestPowerOfTwoIn(uint32_t x)
{
  unsigned long index;
  // ignoring return value, assume x != 0
  _BitScanReverse(&index, x);
  return 1u << index;
}

Which, when directly supported by the target hardware, should be better.

Results on coliru, and latency results on coliru (compare with the baseline too, which should be roughly indicative of the overhead). In the latency result, the first version of highestPowerOfTwoIn doesn't look so good anymore (still OK, but it is a long chain of dependent instructions so it's not a big surprise that it widens the gap with the intrinsics version). Which one of these is the most relevant comparison depends on your actual usage.


If you have some odd hardware with a fast bit-reversal operation (but maybe slow shifts or slow clz), let's call it _rbit, then you can do

uint32_t highestPowerOfTwoIn(uint32_t x)
{
  x = _rbit(x);
  return _rbit(x & -x);
}

This is of course based on the old x & -x which isolates the lowest set bit, surrounded by bit reversals it's isolating the highest set bit.

harold
  • 61,398
  • 6
  • 86
  • 164
2

The lookup table looks like the best option here. Hence, to answer

Can this be optimized even more? Any tricks that could be used here?

Yes we can! Let us beat the standard library binary search!

template <class T>
inline size_t
choose(T const& a, T const& b, size_t const& src1, size_t const& src2)
{
    return b >= a ? src2 : src1;
}
template <class Container>
inline typename Container::const_iterator
fast_upper_bound(Container const& cont, typename Container::value_type const& value)
{
    auto size = cont.size();
    size_t low = 0;

    while (size > 0) {
        size_t half = size / 2;
        size_t other_half = size - half;
        size_t probe = low + half;
        size_t other_low = low + other_half;
        auto v = cont[probe];
        size = half;
        low = choose(v, value, low, other_low);
    }

    return begin(cont)+low;
}

Using this implementation of upper_bound gives me a substantial improvement:

g++ -std=c++14 -O2 -Wall -Wno-unused-but-set-variable -Werror main.cpp && ./a.out
Pow and log took 2536 milliseconds.
Multiplying by 2 took 320 milliseconds.
Binsearching precomputed table of powers took 349 milliseconds.
Binsearching (opti) precomputed table of powers took 167 milliseconds.

(live on coliru) Note that I've improved your benchmark to use random values; by doing so I removed the branch prediction bias.


Now, if you really need to push harder, you can optimize the choose function with x86_64 asm for clang:

template <class T> inline size_t choose(T const& a, T const& b, size_t const& src1, size_t const& src2)
{
#if defined(__clang__) && defined(__x86_64)
    size_t res = src1;
    asm("cmpq %1, %2; cmovaeq %4, %0"
        :
    "=q" (res)
        :
        "q" (a),
        "q" (b),
        "q" (src1),
        "q" (src2),
        "0" (res)
        :
        "cc");
    return res;
#else
    return b >= a ? src2 : src1;
#endif
}

With output:

clang++ -std=c++14 -O2 -Wall -Wno-unused-variable -Wno-missing-braces -Werror main.cpp && ./a.out
Pow and log took 1408 milliseconds.
Multiplying by 2 took 351 milliseconds.
Binsearching precomputed table of powers took 359 milliseconds.
Binsearching (opti) precomputed table of powers took 153 milliseconds.

(Live on coliru)

Community
  • 1
  • 1
YSC
  • 38,212
  • 9
  • 96
  • 149
  • 1
    If you're going to use inline asm anyway, why not go for BSR? – harold Mar 04 '17 at 13:00
  • @harold I didn't even though about it >_ – YSC Mar 04 '17 at 13:02
  • Well I don't know how to write it in GCC's syntax, but you can scan for the highest set bit in the input and then left-shift 1 by that amount, right? – harold Mar 04 '17 at 13:03
  • i got ~33% speed increase from `fast_upper_bound` (with random input) by manually writing the binary search tree (which was tedious, but speed is the most important thing here) – Sopel Mar 04 '17 at 14:08
  • @Sopel I though about doing so, but was discouraged before even start it. – YSC Mar 04 '17 at 14:24
  • @YSC i was just curious how much it would give, because there is apparent overhead to doing it with a general approach, still not as good as bitwise solution though. – Sopel Mar 04 '17 at 14:40
  • That’s fun, my laptop consistently reports that multiplying by 2 is faster than binsearching, even your assembly optimized version: `Pow and log took 1527 milliseconds. Multiplying by 2 took 287 milliseconds. Binsearching precomputed table of powers took 327 milliseconds. Binsearching (opti) precomputed table of powers took 337 milliseconds.` Why does my laptop differ in that matter from coliru? –  Mar 06 '17 at 12:02
  • @gaazkam Interesting. Optimization is often arch-dependent. What compiler? What OS? What arch? – YSC Mar 06 '17 at 12:04
  • `g++ (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609`; `Linux Mint 18 Sarah Cinnamon 3.0.7 (Gtk 3.18.9-1ubuntu3.1)`; `Kernel: 4.4.0-21-generic x86_64 (64 bit gcc: 5.3.1)`; `CPU: Dual core Intel Core i5-5200U (-HT-MCP-) cache: 3072 KB flags: (lm nx sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx) bmips: 8787 clock speeds: max: 2700 MHz 1: 2362 MHz 2: 2458 MHz 3: 2294 MHz 4: 2225 MHz`; compilation flags: `g++ -std=c++14 -O2` –  Mar 06 '17 at 12:07
  • @gaazkam Amazing! Really, I don't know why, I've got similar setup, completely different results. – YSC Mar 06 '17 at 12:09
1

Climbs faster but falls back same speed.

        uint multiply_quick(uint n)
        {
            if (n < 2u) return 1u;
            uint maxpow = 1u;

            if (n > 256u)
            {
                maxpow = 256u * 128u;

                // fast fixing the overshoot
                while (maxpow > n)
                    maxpow = maxpow >> 2;
                // fixing the undershoot
                while (2u * maxpow <= n)
                    maxpow *= 2u;
            }
            else
            {

                // quicker scan
                while (maxpow < n && maxpow != 256u)
                    maxpow *= maxpow;

                // fast fixing the overshoot
                while (maxpow > n)
                    maxpow = maxpow >> 2;

                // fixing the undershoot
                while (2u * maxpow <= n)
                    maxpow *= 2u;
            }
            return maxpow;
        }

maybe this is better suited for 32bit variables using 65k constant literal instead of 256.

huseyin tugrul buyukisik
  • 11,469
  • 4
  • 45
  • 97
0

Just set to 0 all bits but the first one. This should be very fast and efficient

Jack
  • 1,488
  • 11
  • 21
  • 2
    It's not necessarily possible to do this as fast in C++ as in some assembly languages, but there may be useful intrinsic functions provided with your C++ compiler. For example, GCC provides `int __builtin_clz (unsigned int x)` to count the leading 0s, which can then be used to bit-shift 1^31 right. That's likely to take a only couple clock cycles whenever there is indeed CPU support. – Tony Delroy Mar 04 '17 at 11:39
  • 3
    How would you do that? JustRufus' implementation of this idea is no better that gcc's `-O2`. – YSC Mar 04 '17 at 11:44
  • 4
    This is about as useful as telling the OP "Just return the correct result". – CodesInChaos Mar 04 '17 at 14:20
0

As @Jack already mentioned you can simply set to 0 all bits except first one. And here solution:

#include <iostream>

uint16_t bit_solution(uint16_t num)
{
    if ( num == 0 )
        return 0;

    uint16_t ret = 1;
    while (num >>= 1)
        ret <<= 1;

    return ret;
}

int main()
{
    std::cout << bit_solution(1024) << std::endl; //1024
    std::cout << bit_solution(1025) << std::endl; //1024
    std::cout << bit_solution(1023) << std::endl; //512
    std::cout << bit_solution(1) << std::endl; //1
    std::cout << bit_solution(0) << std::endl; //0
}
Community
  • 1
  • 1
JustRufus
  • 492
  • 1
  • 5
  • 10
  • I've [benchmarked it](http://coliru.stacked-crooked.com/a/d17589f0b8011cba), it does not seam to help. – YSC Mar 04 '17 at 11:42
  • Will benchmark this in a second. I was thinking that `-O2` would make this solution pretty much identical to my calculating subsequent powers, ofc I might’ve been very wrong here. –  Mar 04 '17 at 11:43
  • @YSC interesting. I guess multiply_only_once and my solution are the same solution except 0 checking. Also check [this solution](http://coliru.stacked-crooked.com/a/882c0665adf3ccd3) – JustRufus Mar 04 '17 at 12:11
  • Nice! You should [edit] your answer or post another one. (oops, to bad harold did it already). – YSC Mar 04 '17 at 12:24
  • Fun. The site you link to claims: "It would be faster by 2 operations to use the formula and the log base 2 method that uses a lookup table" Well, your benchmark seems to suggest something different. –  Mar 04 '17 at 12:35
0

Well, it's still a loop (and its loop count depends on the number of set bits since they are reset one by one), so its worst case is likely to be worse than the approaches using block bit manipulations.

But it's cute.

uint_fast16_t bitunsetter(uint_fast16_t n)
{
  while (uint_fast16_t k = n & (n-1))
    n = k;
  return n;
}