How to check the number of set bits in an 8-bit unsigned char?

Question

So I have to find the set bits (on 1) of an unsigned char variable in C?

A similar question is How to count the number of set bits in a 32-bit integer? But it uses an algorithm that's not easily adaptable to 8-bit unsigned chars (or its not apparent).

The cited duplicate and Community Wiki answer uses an `int` or 32-bit wide variable. The use of a signed int in the cited dup produces undefined results in C/C++ due to use of `operator >>`. Here is one of the comments on that code: *".... you are not meant to understand or maintain this code, just worship the gods that revealed it to mankind."* Its not clear to me how Flav is supposed to use the answer in the cited dup to answer ***his*** question, given its not readily understandable or maintainable. — jww, Jun 06 '15 at 23:49
I reopened this question because the approach that can be taken with 8-bit variables is not possible to use for 32-bit variables discussed in the duplicate. — Sergey Kalinichenko, Jun 07 '15 at 00:03
@jww seems to me that that answer should be edited to use `uint32_t` .. looks like nobody has been brave enough to do so! Or at least, clarify the preceding sentence . — M.M, Jun 07 '15 at 01:07
@Matt - check out the edit history. I think it went from `signed int` to `uint32_t` to `int`. I avoided the edit precisely because of the edit history. (I'm abrasive at times, but its not my intent to start a battle on this one). — jww, Jun 07 '15 at 01:31
@Flav - you have the embedded tag. What is the processor? Some processors have an ASM instruction to do it. i386 and x84_64 lack it (IIRC), but ARM has it with `VCNT`. — jww, Jun 07 '15 at 02:18
@jww: VCNT is an ARM-NEON instruction. Wish the Cortex-M would have it, but it actually does _not_. As much as ARM7 for instance. — too honest for this site, Jun 07 '15 at 02:24
@jww: Could you please point me where I actually _do_ state that? All I say is that not all ARM CPUs do have VCNT. From your comment one has to get the impression that VCNT is actually provided by all ARM CPUs, which is simply not true. Get the logic right: ARM-NEON is a subset of ARM, but **not** ARM as a whole. And: yes. **Most** ARM CPUs in the world do not have NEON. Have a look at the market shares for Cortex-M0(+)/3/4 and ARM7/ARM9. — too honest for this site, Jun 07 '15 at 02:41
@Olaf - I think you are detracting from the question and the potential answers. At minimum, you are polluting comments. Let the fellow answer the question on what processor he has. Stop with the vanity comments, about how you think the world should work or what the implementation should favor (memory versus time trade-offs). You did it here, and you did it on at least two answers below. They call it [bike shedding](http://bikeshed.com/). Just because you have an opinion does not mean you should share it with us. After 2 or 3 of the badgerings, folks like me lose patience with folks like you. — jww, Jun 07 '15 at 02:50
@jww: Yeah, I prefer logical conclusions; no assumptions. One should not confuse Rep here with rep in real live. Argumentum ad verecundiam? How poor! — too honest for this site, Jun 07 '15 at 02:55
*"One should not confuse Rep here with rep in real live..."* - absolutely. No arguments from me. *"I prefer logical conclusions; not assumptions"* - like your assumption about processors; or your assumption about goals, like memory versus time trade-offs. Or your comments on what would constitute the best inline assembly code. One of you best in this question is I was uninformed because I provided alternatives. I chose not to add a ["mee too"](https://meta.stackexchange.com/questions/52764/what-is-a-protected-question) answer, and simply offered some alternatives without repeating answers. — jww, Jun 07 '15 at 03:06
@jww: Ok, I'll try to setle this and hope you will also: You actually stated here and in your answer that "ARM has a VCNT" instruction. All I did now, was to make clear that **not** all ARM CPUs actually do. Among those are the mostly used embedded ARM Cortex-M as much as the older ARM7 CPUs. So, one **cannot assume** `ARM == VCNT avaliable` as _your_ statement implies. That was all! For code generation: See my statement about reps IRL. — too honest for this site, Jun 07 '15 at 03:06
@jww: You are very welcome just to ignore my posts. Me for myself have no need for such verbicide. — too honest for this site, Jun 07 '15 at 03:09
@Olaf - *"You are very welcome just to ignore my posts."* - actually, we don't have a way to add folks like you to a killfile until your behavior improves. See, for example, [Does Stack Overflow or Meta have a Killfile equivalent?](http://meta.stackoverflow.com/q/290851). — jww, Jun 07 '15 at 03:15

score 7 · Answer 1 · edited May 23 '17 at 11:48

The algorithm suggested in the question How to count the number of set bits in a 32-bit integer? is trivially adapted to 8 bit:

int NumberOfSetBits( uint8_t b )
{
     b = b - ((b >> 1) & 0x55);
     b = (b & 0x33) + ((b >> 2) & 0x33);
     return (((b + (b >> 4)) & 0x0F) * 0x01);
}

It is simply a case of shortening the constants the the least significant eight bits, and removing the final 24 bit right-shift. Equally it could be adapted for 16bit using an 8 bit shift. Note that in the case for 8 bit, the mechanical adaptation of the 32 bit algorithm results in a redundant * 0x01 which could be omitted.

score 4 · Answer 2 · edited Jun 07 '15 at 02:19

4

The fastest approach for an 8-bit variable is using a lookup table.

Build an array of 256 values, one per 8-bit combination. Each value should contain the count of bits in its corresponding index:

int bit_count[] = {
// 00 01 02 03 04 05 06 07 08 09 0a, ... FE FF
    0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, ..., 7, 8
};

Getting a count of a combination is the same as looking up a value from the bit_count array. The advantage of this approach is that it is very fast.

You can generate the array using a simple program that counts bits one by one in a slow way:

for (int i = 0 ; i != 256 ; i++) {
    int count = 0;
    for (int p = 0 ; p != 8 ; p++) {
        if (i & (1 << p)) {
            count++;
        }
    }
    printf("%d, ", count);
}

(demo that generates the table).

If you would like to trade some CPU cycles for memory, you can use a 16-byte lookup table for two 4-bit lookups:

static const char split_lookup[] = {
    0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4
};

int bit_count(unsigned char n) {
    return split_lookup[n&0xF] + split_lookup[n>>4];
}

Demo.

edited Jun 07 '15 at 02:19

jww

97,681
90
411
885

answered Jun 06 '15 at 23:59

Sergey Kalinichenko

714,442
84
1,110
1,523

1

That is fast, but apparently very memory-consuming. Not the best idea for an (potential) 8 bit MCU. For a harvard-MCU, this can easily be even slower. – too honest for this site Jun 07 '15 at 00:28
3

@Olaf 256 bytes (which can be easily shrunk to 128 bytes by packing counts into 4-bit nibbles) could hardly be classified as "memory consuming" except in the tiniest of the embedded systems (PICs come to mind). Even on 8-bit systems from long time back (68HC11 with 16..32K of ROM) 256 bytes was not too much. As far as the speed goes, this would be hard to beat, too: on HC11 this would take two four-cycle instructions (`LDX` and `LDAA` with `X` as an index). – Sergey Kalinichenko Jun 07 '15 at 00:38
You might be surprised. MSP430F2012 comes to mind: 16 bit CPU with 2KiB Flash. and still, the problem with Harvard-Flash is valid. Accesses to Flash might be pretty costly. Note also, that for am STM32F4x MCU, Flash adds multiple waitstates and can interrupt instruction prefetch. All issues you should not forget about when recommending your solution as the only right one. On an 8 bit MCU, the additional instrucions also add up (8/16 bit MCUs often have no barrel shifter). – too honest for this site Jun 07 '15 at 00:40
@Olaf I don't know, I have been out of embedded development since 1997, so I missed much of what happened in nearly two decades. – Sergey Kalinichenko Jun 07 '15 at 00:58
Funny: around that time, I started this as a job (but have done that semi-pro about one decade longer). FYI: as I wrote, there are really still tons of MCUs with few bytes of RAM and litte Flash around. And that is no legacy, gut quite modern devices. – too honest for this site Jun 07 '15 at 01:03
@dasblinkenlight Reading flash in 1997 might be slower than doing this calculation in registers! – M.M Jun 07 '15 at 01:10
@MattMcNabb: Actually not, as MCUs that time mostly had ROM; Flash just started at that time. Microchip e.g. had no good Flash process until some years later (the PIC16F84 used a very costly implementation). Even with Flash, the Speed of ROM/Flash/whatever was the same as for the MCUs, so no wait states. For modern ARM-Cortex-M this has massively changed. I am sometimes very disappointed how slow these MCUs actually perform running out of Flash - exspecially with data tables in Flash. – too honest for this site Jun 07 '15 at 01:17
@MattMcNabb In 1997 flash was still a hot new thing, with price tag too high for things that we designed. We used static memory chips with battery backup for persistent data. – Sergey Kalinichenko Jun 07 '15 at 01:19
@dasblinkenlight: Not EPROM? IIRC, OTP (EPROM without glass-window actually) was more or less standard these days. – too honest for this site Jun 07 '15 at 01:24
@Olaf They put EPROMs in production models, while we always used chips w/glass window for development, testing, and to send out patches to clients (via FedEx :-) – Sergey Kalinichenko Jun 07 '15 at 01:27
Good call, dasblinkenlight. I thought table driven, too. the table takes 255 bytes of data. Is there an algorithm like that shown in the question of the previously cited dup? The algorithm would swap memory for time (assuming, say 6 or 8 instructions or 20 to 40 bytes of instruction). – jww Jun 07 '15 at 01:34
1

@jww The inner loop in the code that generates the table or the code in Olaf's answer do counting one bit at a time. Note that you could also pack data into 4-bit nibbles for easy retrieval, or use a split look-up. I'll edit with a split look-up solution and a demo. – Sergey Kalinichenko Jun 07 '15 at 01:39

score 2 · Answer 3 · answered Jun 07 '15 at 05:46

2

I think you are looking for Hamming Weight algorithm for 8bits? If it is true, here is the code:

unsigned char in = 22; //This is your input number
unsigned char out = 0;
in = in - ((in>>1) & 0x55);
in = (in & 0x33) + ((in>>2) & 0x33);
out = ((in + (in>>4) & 0x0F) * 0x01) ;

answered Jun 07 '15 at 05:46

Ngo Thanh Nhan

503
2
5
17

What is the purpose of multiplying by `0x01` in `out = ((in + (in>>4) & 0x0F) * 0x01);`? – jww Jun 07 '15 at 11:35
That is the last step of Hamming Weight algorithm. http://en.wikipedia.org/wiki/Hamming_weight . If the input is 16bits, it should be 0x0101 – Ngo Thanh Nhan Jun 07 '15 at 15:31
Multiplication by 1 does not make much sense. For wider input, it does actually. – too honest for this site Jun 07 '15 at 16:29

score 1 · Answer 4 · edited May 23 '17 at 12:19

Counting the number of digits different than 0 is also known as a Hamming Weight. In this case, you are counting the number of 1's.

Dasblinkenlight provided you with a table driven implementation, and Olaf provided you with a software based solution. I think you have two other potential solutions. The first is to use a compiler extension, the second is to use an ASM specific instruction with inline assembly from C.

For the first alternative, see GCC's __builtin_popcount(). (Thanks to Artless Noise).

For the second alternative, you did not specify the embedded processor, but I'm going to offer this in case its ARM based.

Some ARM processors have the VCNT instruction, which performs the count for you. So you could do it from C with inline assembly:

inline
unsigned int hamming_weight(unsigned char value) {
    __asm__ __volatile__ (
            "VCNT.8"
            : "=value"
            : "value"
    );

    return value;
}

Also see Fastest way to count number of 1s in a register, ARM assembly.

For completeness, here is Kernighan's bit counting algorithm:

int count_bits(int n) {
    int count = 0;
    while(n != 0) {
        n &= (n-1);
        count++;
    }
    return count;
}

Also see Please explain the logic behind Kernighan's bit counting algorithm.

Explorer09 · Answer 5 · 2022-01-14T03:35:27.810

I made an optimized version. With a 32-bit processor, utilizing multiplication, bit shifting and masking can make smaller code for the same task, especially when the input domain is small (8-bit unsigned integer).

The following two code snippets are equivalent:

unsigned int bit_count_uint8(uint8_t x)
{
    uint32_t n;
    n = (uint32_t)(x * 0x08040201UL);
    n = (uint32_t)(((n >> 3) & 0x11111111UL) * 0x11111111UL);
    /* The "& 0x0F" will be optimized out but I add it for clarity. */
    return (n >> 28) & 0x0F;
}

/*
unsigned int bit_count_uint8_traditional(uint8_t x)
{
    x = x - ((x >> 1) & 0x55);
    x = (x & 0x33) + ((x >> 2) & 0x33);
    x = ((x + (x >> 4)) & 0x0F);
    return x;
}
*/

This produces smallest binary code for IA-32, x86-64 and AArch32 (without NEON instruction set) as far as I can find.
For x86-64, this doesn't use the fewest number of instructions, but the bit shifts and downcasting avoid the use of 64-bit instructions and therefore save a few bytes in the compiled binary.
Interestingly, in IA-32 and x86-64, a variant of the above algorithm using a modulo ((((uint32_t)(x * 0x08040201U) >> 3) & 0x11111111U) % 0x0F) actually generates larger code, due to a requirement to move the remainder register for return value (mov eax,edx) after the div instruction. (I tested all of these in Compiler Explorer)

Explanation

I denote the eight bits of the byte x, from MSB to LSB, as a, b, c, d, e, f, g and h.

                               abcdefgh
*   00001000 00000100 00000010 00000001 (make 4 copies of x
---------------------------------------  with appropriate
abc defgh0ab cdefgh0a bcdefgh0 abcdefgh  bit spacing)
>> 3                                   
---------------------------------------
    000defgh 0abcdefg h0abcdef gh0abcde
&   00010001 00010001 00010001 00010001
---------------------------------------
    000d000h 000c000g 000b000f 000a000e
*   00010001 00010001 00010001 00010001
---------------------------------------
    000d000h 000c000g 000b000f 000a000e
... 000h000c 000g000b 000f000a 000e
... 000c000g 000b000f 000a000e
... 000g000b 000f000a 000e
... 000b000f 000a000e
... 000f000a 000e
... 000a000e
... 000e
    ^^^^ (Bits 31-28 will contain the sum of the bits
          a, b, c, d, e, f, g and h. Extract these
          bits and we are done.)

score 0 · Answer 6 · answered Jun 07 '15 at 00:05

0

Maybe not the fastest, but straightforward:

int count = 0;

for (int i = 0; i < 8; ++i) {
    unsigned char c = 1 << i;
    if (yourVar & c) {
        //bit n°i is set
        //first bit is bit n°0
        count++;
    }
}

answered Jun 07 '15 at 00:05

ThreeStarProgrammer57

2,906
2
16
24

too honest for this site · Answer 7 · 2015-06-07T16:31:36.267

For 8/16 bit MCUs, a loop will very likely be faster than the parallel-addition approach, as these MCUs cannot shift by more than one bit per instruction, so:

size_t popcount(uint8_t val)
{
    size_t cnt = 0;
    do {
        cnt += val & 1U;    // or: if ( val & 1 ) cnt++;
    } while ( val >>= 1 ) ;
    return cnt;
}

For the incrementation of cnt, you might profile. If still too slow, an assember implementation might be worth a try using carry flag (if available). While I am in against using assembler optimizations in general, such algorithms are one of the few good exceptions (still just after the C version fails).

If you can omit the Flash, a lookup table as proposed by @dasblinkenlight is likey the fastest approach.

Just a hint: For some architectures (notably ARM and x86/64), gcc has a builtin: __builtin_popcount(), you also might want to try if available (although it takes int at least). This might use a single CPU instruction - you cannot get faster and more compact.

score 0 · Answer 8 · answered Aug 23 '21 at 00:43

Allow me to post a second answer. This one is the smallest possible for ARM processors with Advanced SIMD extension (NEON). It's even smaller than __builtin_popcount() (since __builtin_popcount() is optimized for unsigned int input, not uint8_t).

#ifdef __ARM_NEON
/* ARM C Language Extensions (ACLE) recommends us to check __ARM_NEON before
   including <arm_neon.h> */
#include <arm_neon.h>

unsigned int bit_count_uint8(uint8_t x)
{
    /* Set all lanes at once so that the compiler won't emit instruction to
       zero-initialize other lanes. */
    uint8x8_t v = vdup_n_u8(x);
    /* Count the number of set bits for each lane (8-bit) in the vector. */
    v = vcnt_u8(v);
    /* Get lane 0 and discard other lanes. */
    return vget_lane_u8(v, 0);
}
#endif

How to check the number of set bits in an 8-bit unsigned char?

8 Answers8

Explanation