There are algorithms that are better for more than 8 bits. @rcgldr's answer is a useful start to a 16 or 32-bit popcount. See How to count the number of set bits in a 32-bit integer? for some bithack and other algorithms, including table lookup.
You could consider a 4-bit lookup table. MSP430 shifts are slow-ish (1 cycle per bit, and 1 instruction per bit if you don't have MSP430X). Or use a big 8-bit lookup table.
Or loop over the set bits, clearing the low bit with v &= v - 1;
. In MSP430 that takes a MOV, DEC, and AND. That's great if only a couple bits are usually set, but they're often scattered.
But the simplest and smallest code-size way is to just loop over all the bits one at a time.
If you're going to loop one bit at a time to keep it simple and compact, you want to take advantage of the carry flag by shifting into carry and using ADDC (add-with-carry).
I tried to write C that compilers could turn into nice asm using ADDC, but https://godbolt.org/z/2Ev2IC is the best I managed. GCC and clang don't do very well for MSP430 with the tmp = a+a; carry = tmp<a;
idiom that they recognize for x86 and most other architectures.
So anyway, you wanted asm in the first place:
;; simple naive bit-count. Small code-size and not too slow for 8 bits
;; input in r12, result: r11 = popcount(r12)
mov.w #0, r11 ; retval = 0
.popcount_loop: ; do{
add.b r12,r12 ; shift a bit into carry flag
addc #0, r11 ; add that bit to r11: r11 += 0 + C
tst.b r12
jnz .popcount_loop ; } while( (uint8_t)r12 != 0);
Using byte operand-size for the add
means that bit 7 goes into C, not bit 15.
We could instead use a right-shift to put the low bit into the C flag, especially if we expect many inputs to be small numbers (so the non-zero bits are all toward the lower end). According to this copy of a MSP430 / MSP430X instruction-set reference google found, plain MSP430 doesn't have shift right, only rotate-right through carry. RRC[.W]
/ RRC.B
. MSP430X has some "rotates" that actually shift in zeros, so they're really shifts. But we don't need that if we make sure C=0 before we run it. Since the population count won't wrap, ADDC will reliably clear C for us.
We can optimize this to fewer instructions inside the loop (same code size but runs faster), by having both JNZ and ADDC consume flags from the same ADD. Since ADDC also writes flags, that means it has to be in the next iteration. So we have to skew the loop. We can peel the first iteration and do its ADD outside the loop. We won't check for zero afterwards, but that's fine. Running one extra iteration for input = 0x80
is not a correctness problem, and not worth spending extra instructions on.
; simple looping popcount, optimized for small numbers (right shift)
; and optimized for fewer instructions inside the loop
;; input in r12, result: r11 = popcount(r12)
xor.w r11, r11 ; r11=0, C=!Z=0. (mov doesn't set flags; this saves a CLRC)
rrc.b r12 ; C = lsb(r12); r12 >>= 1 ; prep for first iter
.popcount_loop: ; do{
addc #0, r11 ; result += C; Clears C because r11 won't wrap
rrc.b r12 ; C = lsb(r12); r12 >>= 1; Z = (r12==0)
jnz .popcount_loop ; } while( (uint8_t)r12 != 0);
addc #0, r11 ; we left the loop with the last bit still in C
If your input value is zero-extended, you could use rrc.w r12
so the loop works for 8 or 16-bit values. But it's no slower because it still exits after shifting all the bits out the right.
Skewing the loop and peeling the first half of the first iteration, and last half of the last iteration, only cost us one extra instruction total. (And they're still all single-word instructions.)
You mention odd/even. Do you actually just want the parity? (Whether the population count is odd or even)? That's the same thing as the horizontal XOR of all the bits.
; Needs MSP430X for rrum, otherwise you can only shift by 1 bit per instruction
;; input in r12, result: r12=parity(r12)
;; clobbers: r11
mov.b r12, r11 ; copy the low byte, zero the upper byte of R11 (not that it matters)
rrum #4, r11 ; costs 4 cycles for shift-count = 4
xor r11, r12 ; low 4 bits ^= (high 4 bits >> 4)
mov.b r12, r11
rrum #2, r11 ; costs 2 cycles for shift-count = 2
xor r11, r12 ; narrow again to 2 bits
mov.b r12, r11
rrum #1, r11 ; costs 1 cycle for shift-count = 1.
xor r11, r12 ; narrow again to 2 bits
and #1, r12 ; clear high garbage from the high bits.
; ret if this isn't inline
You could do this with a loop, e.g. use the popcount loop and do and #1, r12
at the end.
I feel like maybe we could save instructions if we shifted left (by 4 then 2) and did the last step (shift by 1) with an add.b r12,r12
, because signed overflow (the V flag) = carry_in XOR carry_out for the sign bit. With both inputs the same for add, the existing sign bit will always be 0+0=00 or 1+1=10, so the sign bit = carry_in to the sign bit.
So with a bit-pattern like r12.b = XY??????
, add.b r12,r12
sets V = X^Y
, the horizontal XOR of the top two bits of the input. Because Y
is the carry-in to the MSB, and X is the carry-out.
This would be good if you want to branch on it, but MSP430 doesn't seem to have a jXX
that branches on V
being set or not. It has JL
and JGE
which branch on (N XOR V)
(i.e. signed compare), but N
will be equal to the MSB, so N ^ V
is just C, after our left-shift V sets V = N ^ C
. I guess you'd have to get the flag word out of the flag register and shift/mask it! Or test that flag bit and JNZ.