14

I am interested, which is the optimal way of calculating the number of bits set in byte by this way

template< unsigned char byte > class BITS_SET
{
public:
    enum {
     B0 = (byte & 0x01) ? 1:0,
     B1 = (byte & 0x02) ? 1:0,
     B2 = (byte & 0x04) ? 1:0,
     B3 = (byte & 0x08) ? 1:0,
     B4 = (byte & 0x10) ? 1:0,
     B5 = (byte & 0x20) ? 1:0,
     B6 = (byte & 0x40) ? 1:0,
     B7 = (byte & 0x80) ? 1:0
    };
public:
 enum{RESULT = B0+B1+B2+B3+B4+B5+B6+B7};
};

Maybe it is optimal when value of byte is known at run-time? Is it recommended use this in code?

Donald Duck
  • 8,409
  • 22
  • 75
  • 99
  • This is called a population count and it can be done much more efficiently than testing one bit at a time. On x86 it can be done with a single instruction. On other architectures it can be done with a few instructions. – Paul R Mar 30 '12 at 20:28
  • 1
    @PaulR, the template solution proposed would calculate at compile time, which would take less than one instruction at run-time! – Mark Ransom Mar 30 '12 at 20:32
  • Ah - sorry - missed the point of the question ! – Paul R Mar 30 '12 at 20:34
  • @PaulR, the question is a bit confusing - it shows code that would only work at compile time but asks if it would be optimal at run time. I assume that English is not the asker's native language. – Mark Ransom Mar 30 '12 at 20:41
  • yes sure ,it is not my native language,i am from georgia,but i didn't understand why you mentioned it? –  Mar 30 '12 at 20:47
  • https://en.cppreference.com/w/cpp/numeric/popcount – Jesper Juhl Apr 20 '23 at 13:14

13 Answers13

30

For one byte of data, the optimal way considering both speed and memory consumption:

uint8_t count_ones (uint8_t byte)
{
  static const uint8_t NIBBLE_LOOKUP [16] =
  {
    0, 1, 1, 2, 1, 2, 2, 3, 
    1, 2, 2, 3, 2, 3, 3, 4
  };


  return NIBBLE_LOOKUP[byte & 0x0F] + NIBBLE_LOOKUP[byte >> 4];
}

Calling this function from a for loop should yield quite an efficient program on most systems. And it is very generic.

Lundin
  • 195,001
  • 40
  • 254
  • 396
  • This takes twice as long as a plain lookup of the byte in a 256 entry table. – Ira Baxter Jun 22 '15 at 08:25
  • 8
    @IraBaxter Indeed. And it takes 16 times less memory. So this is for example a common solution in embedded systems. – Lundin Jun 22 '15 at 10:56
  • I though `const` where `static` by default. Aren't they? – Rui Monteiro Apr 08 '22 at 07:38
  • @RuiMonteiro Not necessarily. Local scope `const` may in theory as well get allocated on the stack as in read-only memory. I suppose that some bad embedded systems compiler might get the idea to copy down the lookup table from flash to RAM before executing this. And there's also the Harvard system aspect - on Harvard systems you might not even be able to read the table straight from flash. In which case you might have to add some non-standard keyword for the code to work. – Lundin Apr 08 '22 at 08:57
19

For 8-bit values, just use a 256-element lookup table.

For larger sized inputs, it's slightly less trivial. Sean Eron Anderson has several different functions for this on his Bit Twiddling Hacks page, all with different performance characteristics. There is not one be-all-end-all-fastest version, since it depends on the nature of your processor (pipeline depth, branch predictor, cache size, etc.) and the data you're using.

Adam Rosenfield
  • 390,455
  • 97
  • 512
  • 589
  • Don't use a 256-element lookup table. It's not worth it. Use a 16-element one and just look up each nibble. – user541686 Feb 15 '21 at 01:43
13

Why not just use the standard library? That way the optimal way should be determined by the implementation, and is likely better than any standards compliant code that you can actually write. For instance, if you're on an x86 this compiles to a single instruction but only if you're targeting CPUs that support it.

#include <bitset>
#include <iostream>

int main() {
  unsigned char bitfield = 17;
  std::cout << std::bitset<8>(bitfield).count() <<
    std::endl;
}
OmnipotentEntity
  • 16,531
  • 6
  • 62
  • 96
  • FYI: with VS2017, on an i5 with SSE4 support, in release, this still doesn't compile to a POPCNT, but instead uses a Good 'Ol LUT...sadly – SonarJetLens Aug 04 '17 at 19:46
  • Or just use [std::popcount()](https://en.cppreference.com/w/cpp/numeric/popcount). – Jesper Juhl Apr 20 '23 at 14:49
  • `std::popcount` is indeed a wonderful function. Thank you for pointing it out. However, when I initially wrote this answer back in 2015 it was not yet part of the C++ standard. It wasn't standardized until C++20. – OmnipotentEntity Apr 20 '23 at 23:25
4

For just a single byte value, the fastest way is to store the answer in an 256 byte array that you index with the value. For example, bits_set[] = {0, 1, 1, 2, ...

TJD
  • 11,800
  • 1
  • 26
  • 34
  • 3
    With the speed of today's processors and the comparative slowness of memory, I'd have to see a shootout between a lookup table and some bit-twiddling code before I'd declare the lookup to be the "fastest". – Mark Ransom Mar 30 '12 at 20:31
  • 1
    If you don't feel like making a 256 element array, you could instead make a 16 element array, and bitshift the second half. – Matthew Mar 30 '12 at 20:39
  • @MarkRansom, that's possible. Actually some CPUs have an assembly instruction that counts bits set in a single cycle. But also, I spend a lot of time on embedded systems that have single cycle access to all their (small amount of) memory. – TJD Mar 30 '12 at 20:39
  • Don't be so sure the "bitcount" machine instruction operates in a single clock cycle. Whether it does depends on the specific microarchitecture, and more likely on whether the chip is used by the NSA or not :-} – Ira Baxter Mar 30 '12 at 20:42
  • @MarkRansom: But a small lookup table would very likely fit within a CPU's L1 cache. – Nathan Osman Mar 24 '13 at 22:03
  • @Matthew: Once you are only using 16 bytes, you might even be able to make use of some of the processor registers for this purpose. – Nathan Osman Mar 24 '13 at 22:04
  • @GeorgeEdison, yes it will fit, but what will it push out? – Mark Ransom Mar 24 '13 at 22:14
3

Why not do a left shift and mask off the rest?

int countBits(unsigned char byte){
    int count = 0;
    for(int i = 0; i < 8; i++)
        count += (byte >> i) & 0x01; // Shift bit[i] to the first position, and mask off the remaining bits.
    return count;
}

This can easily be adapted to handle ints of any size by simply calculating how many bits there is in the value being counted, then use that value in the counter loop. This is all very trivial to do.

int countBits(unsigned long long int a){
    int count = 0;
    for(int i = 0; i < sizeof(a)*8; i++)
        count += (a >> i) & 0x01;
    return count;
}
Wkter
  • 152
  • 3
2

C++20 introduced std::popcount from the header <bit>

std::popcount(0b1101u) will return 3

See https://en.cppreference.com/w/cpp/numeric/popcount for more details.

Clèm
  • 424
  • 3
  • 14
2

The usual answer for "fastest way to do bitcount" is "look up the byte in an array". That kind of works for bytes, but you pay an actual memory access for it. If you only do this once in awhile, it is likely the fastest, but then you don't need the fastest if you only do it once in awhile.

If you do it a lot, you are better off batching up bytes into words or doublewords, and doing fast bitcount operations on these. These tend to be pure arithmetic, since you can't realistically lookup a 32 bit value in an array to get its bitcount. Instead you combine values by shifting and masking in clever ways.

A great source of clever tricks for doing this is Bit Hacks.

Here is the scheme published there for counting bits in 32 bit words in C:

 unsigned int v; // count bits set in this (32-bit value)
 unsigned int c; // store the total here

 v = v - ((v >> 1) & 0x55555555);                    // reuse input as temporary
 v = (v & 0x33333333) + ((v >> 2) & 0x33333333);     // temp
 c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; // count
Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • so it means that,my code is not protable yes?generally my question refers which one is better if compiler supports template programming? –  Mar 30 '12 at 20:49
  • Your target computer has a native word size easily known to your application at compile time. If you predicate your code on number of bits processed, you will have at most 4 routines in the forseeable future: one for 8, 16, 32, 64 bits (you decide if you think 128 bits will happen soon). I think that's pretty portable, and it will be fast on whichever target architecture you've compiled for. [It isn't clear to me that you want to use template metaprogramming for this]. – Ira Baxter Mar 30 '12 at 21:21
  • this is for 32 bit numbers... for 8 bit is simpler (check my answer) – Zibri Sep 02 '19 at 15:39
1
#include <iostream>
#include <climits> // for CHAR_BIT (most likely to be 8)
#include <cstring> // for memset
#include <new> 

static const int DUMMY = -1;

// first approch : activate the O(8) function in first get try... after that its O(1);
class bitsInByteflyLUT
{
    typedef unsigned char byte;

    public:
        bitsInByteflyLUT();     //CTOR - throws std::bad_alloc
        ~bitsInByteflyLUT();    //DTOR


        int Get_bitsInByte(byte _byte);     


    private:
        // CLASS DATA
        int*    flyLUT;

        // PRIVATE FUNCTIONS
        int bitsInByte(byte _byte);
        // O(8) for finding how many bits are ON in a byte.
        // answer can be between 0 to CHAR_BIT.

        bitsInByteflyLUT(const bitsInByteflyLUT & _class); // COPY CTOR - forbidden
        const bitsInByteflyLUT & operator= (const bitsInByteflyLUT& _class);
        // ASSIGN OPERATOR - forbidden

};

bitsInByteflyLUT::bitsInByteflyLUT()
{
    size_t nIndexes = 1 << CHAR_BIT;
    try
    {
        flyLUT =  new int[nIndexes];
    }
    catch (std::bad_alloc& ba)
    {
        throw;
    }
    memset(flyLUT, DUMMY, sizeof(int)*nIndexes);
}


bitsInByteflyLUT::~bitsInByteflyLUT()
{
    delete[] flyLUT;
}


int bitsInByteflyLUT::Get_bitsInByte(byte _byte)
{
    if (flyLUT[_byte] == DUMMY) // if its first time we try to get answer for this char.
    {
        flyLUT[_byte] = bitsInByte(_byte); // O(8)
    }
    return flyLUT[_byte]; // O(1) 
}

int bitsInByteflyLUT::bitsInByte(byte _byte)
{   
    byte nBits = CHAR_BIT;
    byte counter = 0;
    byte mask = 1;
    while(nBits--)
    {
        if(mask & _byte)
        {
            ++counter;
        }
        mask <<= 1;
    }
    return counter;
}





int main ()
{
    using std::cout;
    using std::endl;

    bitsInByteflyLUT flut;

    for (unsigned int i = 0; i < (1 << CHAR_BIT); i += 1)
    {   
        cout << i << " " << flut.Get_bitsInByte(i) << endl;
    }

    return 0;
}
wewewe
  • 11
  • 1
1

Using C++17 you can precalculate the lookup table using a constexpr lambda. Easier to reason about the correctness of it rather than a ready copy-pasted table.

#include <array>
#include <cstdint>

static constexpr auto bitsPerByteTable = [] {
  std::array<uint8_t, 256> table{};
  for (decltype(table)::size_type i = 0; i < table.size(); i++) {
    table.at(i) = table.at(i / 2) + (i & 1);
  }
  return table;
}();
Zitrax
  • 19,036
  • 20
  • 88
  • 110
  • You can also run table generation off line to produce the table and then simply use it. As a practical matter, this is reasonable only for 8 bit values. If you want to do this for larger values, using multiple 8 bit lookups isn't going to be the fastest. See my answer. – Ira Baxter Aug 09 '21 at 17:04
1

In gcc you can use __builtin_popcount(unsigned) function.
It should efficiently use the optimal solution for the target hardware platform.
With -march=core-avx2 (highest level compatible with my cpu) the popcntl x86_64 assembly instruction was used, doing it in the hardware.
With the default x86_64 instruction set a popcntl function was called that implements the optimal C (clever hacks) algorithm.
There's also __builtin_popcountl and __builtin_popcountll for unsigned long and unsigned long long.

Marcelo Pacheco
  • 152
  • 1
  • 5
0

For C++ 14+: (https://onlinegdb.com/q_PR4Iocb)

#include <stdint.h>
#include <iostream>
#include <sstream>

using namespace std;

template <class T, int width = sizeof(T) << 3>
struct stBitPattern
{
    static inline constexpr T res(T x)
    {
        constexpr int hw = width >> 1;

        x = stBitPattern<T, hw>::res(x);
        x += x << hw;

        return x;
    }
};

template <class T>
struct stBitPattern<T, 8>
{
    static inline constexpr T res(T x)
    {
        return x;
    }
};

class csPop
{
    private:
        template <class T>
        static inline constexpr T _pop8(T x)
        {
            constexpr T m1 = stBitPattern<T>::res(0x55); //binary: 0101...
            constexpr T m2 = stBitPattern<T>::res(0x33); //binary: 00110011..
            constexpr T m4 = stBitPattern<T>::res(0x0f); //binary:  4 zeros,  4 ones ...

            x -= (x >> 1) & m1;             //put count of each 2 bits into those 2 bits
            x = (x & m2) + ((x >> 2) & m2); //put count of each 4 bits into those 4 bits 
            x = (x + (x >> 4)) & m4;        //put count of each 8 bits into those 8 bits
            return x;
        }

        template <class T, int width = sizeof(T) << 3>
        struct _pop
        {
            static inline constexpr T res(T x)
            {
                const int hw = width >> 1;
    
                x = _pop<T, hw>::res(x);
                x += x >> hw;
    
                return x;
            }
        };

        template <class T>
        struct _pop<T, 8>
        {
            static inline constexpr T res(T x)
            {
                x = _pop8(x);
                return x;
            }
        };

        template<class T>
        static inline constexpr bool _chkIfTAvail(T *p, T *pend);

    public:
        template <class T>
        static inline constexpr int pop(T x)
        {
            x = _pop<T>::res(x);
            return (int)(x & 0x7f);
        }

        template <class T>
        static int pop(T *p, size_t sz);
};

template<class T>
constexpr bool csPop::_chkIfTAvail(T *p, T *pend)
{
    return ((uint8_t*)pend-(uint8_t*)p) >= sizeof(T);
}

template <class T>
int csPop::pop(T *p, size_t sz)
{
    int res = 0;
    T *pend = (T*)((uint8_t*)p + sz);
    
    if (sz >= sizeof(T))
    {
        int t = (size_t)p & (sizeof(T) - 1);
        
        sz -= t;

        if (t >= sizeof(uint8_t))
        {
            res += pop(*(uint8_t*)p);
            p = (T*)((uint8_t*)p + 1);
            t -= sizeof(uint8_t);
        }       
        if (t >= sizeof(uint16_t))
        {
            res += pop(*(uint16_t*)p);
            p = (T*)((uint16_t*)p + 1);
            t -= sizeof(uint16_t);
        }       
        if (t >= sizeof(uint32_t))
        {
            res += pop(*(uint32_t*)p);
            p = (T*)((uint32_t*)p + 1);
            t -= sizeof(uint32_t);
        }
        if (t >= sizeof(uint64_t))
        {
            res += pop(*(uint64_t*)p);
            p = (T*)((uint64_t*)p + 1);
            t -= sizeof(uint64_t);
        }

        if (_chkIfTAvail(p, pend))
        {
            sz -= sizeof(T);
            res += pop(*p++);

            for(; _chkIfTAvail(p, pend); p++, sz-= sizeof(T))
            {
                res += pop(*p);
            }       
        }
    } // if (sz >= sizeof(T))

    if (sz >= sizeof(uint64_t))
    {
        res += pop(*(uint64_t*)p);
        p = (T*)((uint64_t*)p + 1);
        sz -= sizeof(uint64_t);
    }
    if (sz >= sizeof(uint32_t))
    {
        res += pop(*(uint32_t*)p);
        p = (T*)((uint32_t*)p + 1);
        sz -= sizeof(uint32_t);
    }
    if (sz >= sizeof(uint16_t))
    {
        res += pop(*(uint16_t*)p);
        p = (T*)((uint16_t*)p + 1);
        sz -= sizeof(uint16_t);
    }
    if (sz >= sizeof(uint8_t))
    {
        res += pop(*(uint8_t*)p);
        p = (T*)((uint8_t*)p + 1);
        sz -= sizeof(uint8_t);
    }

    return res;
}

int main()
{
    uint32_t x = 16777215;
    cout << "pop(" << x << ") = " << csPop::pop(x) << endl;
    
    uint32_t a[] = {x, x, x, x, 15};
    uint32_t *p_end = a + sizeof(a)/sizeof(a[0]) - 1;
    uint32_t *p = a;
    stringstream ss;
    
    for(; p_end - p; p++)
    {
        ss << to_string(*p) << ", ";
    }
    ss << to_string(*p);
    
    cout << "pop({" << ss.str() << "}) = " << csPop::pop(a, sizeof(a)) << endl;

    return 0;
}

Result:

pop(16777215) = 24
pop({16777215, 16777215, 16777215, 16777215, 15}) = 100
zenbooster
  • 29
  • 5
0
int count(int a){ return a == 0 ? 0 : 1 + count(a&(a-1)); }
Jarosław Gomułka
  • 4,935
  • 18
  • 24
  • would not recursively calling function increases speed of function exponential?maybe it is better using while loop? –  Mar 30 '12 at 20:32
  • Modern compilers are optimizing such simple functions to "non-recursive" versions. – Jarosław Gomułka Mar 30 '12 at 20:34
  • This still takes N steps if there are N bits in the word. If N=32 this can take 32 steps with a conditional branch in each one. Bad news. Better to use a logarithmic summing; for N~~32 you get constant time. The BitHacks code I quoted takes 14 machine instructions, many of them single clocks on modern CPUs. – Ira Baxter Mar 31 '12 at 01:18
  • This takes 2 * number_of_set_bits + 1. So for 10000010 it will need 5 instructions. But in general, solution from BitHacks is definitly better. – Jarosław Gomułka Mar 31 '12 at 06:39
-2
#include <ctime>
#include <iostream>
using namespace std;

int count1s(unsigned char byte) {
  if (byte == 0) {
    return 0;
  }

  if (byte & 0x01) {
    return 1 + count1s(byte >> 1);
  }
  return count1s(byte >> 1);
}

int count1s2(unsigned char byte) {
  static const int ones[256] = {
      0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 1, 2, 2, 3, 2, 3, 3, 4,
      2, 3, 3, 4, 3, 4, 4, 5, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
      2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 1, 2, 2, 3, 2, 3, 3, 4,
      2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
      2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6,
      4, 5, 5, 6, 5, 6, 6, 7, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
      2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5,
      3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
      2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6,
      4, 5, 5, 6, 5, 6, 6, 7, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
      4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8};

  return ones[(int)byte];
}

int main() {
  time_t start = clock();
  int c = count1s(205);
  time_t end = clock();
  cout << "count1: " << c << " time: " << double(end - start) << endl;
  start = clock();
  c = count1s2(205);
  end = clock();
  cout << "count2: " << c << " time: " << double(end - start) << endl;
  return 0;
}
Zitrax
  • 19,036
  • 20
  • 88
  • 110
zed
  • 15
  • 1
    Please correctly format your answer, as well as give an explanation. Simply slapping code in a reponse doesn't qualify as a good answer. – m_callens Jul 07 '16 at 19:51
  • Thanks for giving us the 256-element lookup table. I just hope it's right. – Rodrigo Dec 01 '18 at 01:37