Need to count number of 1's in binary with assembly

Question

i have task where i should count number of 1's in binary that I set which have odd number then i need to display this on 7 segment display.

On the code I wrote a comment where I should do this.

I am working with Texas Instruments msp430. I looked at the other solution but they made with C not with assembly and unfortunately cant figure out how to do it on assembly.

       bis.b #11111111b, &P1DIR
       bic.b #11111111b, &P1OUT

loop_1:
       ; do stuff with &P1OUT
       call #delay
       ...

delay

       mov #0, R5
       mov #0, R4

odd_even:
           ;Over here i need to count number of 1's in binary but cant figure out how to do it
           jnz try
           jz delay_over


      ...
           ret

What architecture is this? Please tag! – fuz Jan 15 '19 at 20:16 — fuz, Jan 15 '19 at 20:16

Peter Cordes · Answer 1 · 2019-01-16T09:56:23.687

There are algorithms that are better for more than 8 bits. @rcgldr's answer is a useful start to a 16 or 32-bit popcount. See How to count the number of set bits in a 32-bit integer? for some bithack and other algorithms, including table lookup.

You could consider a 4-bit lookup table. MSP430 shifts are slow-ish (1 cycle per bit, and 1 instruction per bit if you don't have MSP430X). Or use a big 8-bit lookup table.

Or loop over the set bits, clearing the low bit with v &= v - 1;. In MSP430 that takes a MOV, DEC, and AND. That's great if only a couple bits are usually set, but they're often scattered.

But the simplest and smallest code-size way is to just loop over all the bits one at a time.

If you're going to loop one bit at a time to keep it simple and compact, you want to take advantage of the carry flag by shifting into carry and using ADDC (add-with-carry).

I tried to write C that compilers could turn into nice asm using ADDC, but https://godbolt.org/z/2Ev2IC is the best I managed. GCC and clang don't do very well for MSP430 with the tmp = a+a; carry = tmp<a; idiom that they recognize for x86 and most other architectures.

So anyway, you wanted asm in the first place:

;; simple naive bit-count.  Small code-size and not too slow for 8 bits

;; input in r12,  result: r11 = popcount(r12)
mov.w     #0, r11        ; retval = 0
.popcount_loop:          ; do{
    add.b   r12,r12          ; shift a bit into carry flag
    addc    #0, r11          ; add that bit to r11:  r11 += 0 + C

    tst.b    r12
    jnz   .popcount_loop ; } while( (uint8_t)r12 != 0);

Using byte operand-size for the add means that bit 7 goes into C, not bit 15.

We could instead use a right-shift to put the low bit into the C flag, especially if we expect many inputs to be small numbers (so the non-zero bits are all toward the lower end). According to this copy of a MSP430 / MSP430X instruction-set reference google found, plain MSP430 doesn't have shift right, only rotate-right through carry. RRC[.W] / RRC.B. MSP430X has some "rotates" that actually shift in zeros, so they're really shifts. But we don't need that if we make sure C=0 before we run it. Since the population count won't wrap, ADDC will reliably clear C for us.

We can optimize this to fewer instructions inside the loop (same code size but runs faster), by having both JNZ and ADDC consume flags from the same ADD. Since ADDC also writes flags, that means it has to be in the next iteration. So we have to skew the loop. We can peel the first iteration and do its ADD outside the loop. We won't check for zero afterwards, but that's fine. Running one extra iteration for input = 0x80 is not a correctness problem, and not worth spending extra instructions on.

; simple looping popcount, optimized for small numbers (right shift)
; and optimized for fewer instructions inside the loop

;; input in r12,  result: r11 = popcount(r12)
xor.w     r11, r11        ; r11=0,  C=!Z=0.   (mov doesn't set flags; this saves a CLRC)

rrc.b     r12             ; C = lsb(r12);   r12 >>= 1  ; prep for first iter

.popcount_loop:            ; do{
    addc    #0, r11          ; result += C;  Clears C because r11 won't wrap
    rrc.b   r12              ; C = lsb(r12);   r12 >>= 1;  Z = (r12==0)
    jnz    .popcount_loop  ; } while( (uint8_t)r12 != 0);

    addc    #0, r11        ; we left the loop with the last bit still in C

If your input value is zero-extended, you could use rrc.w r12 so the loop works for 8 or 16-bit values. But it's no slower because it still exits after shifting all the bits out the right.

Skewing the loop and peeling the first half of the first iteration, and last half of the last iteration, only cost us one extra instruction total. (And they're still all single-word instructions.)

You mention odd/even. Do you actually just want the parity? (Whether the population count is odd or even)? That's the same thing as the horizontal XOR of all the bits.

; Needs MSP430X for rrum, otherwise you can only shift by 1 bit per instruction

;; input in r12,  result: r12=parity(r12)
;; clobbers: r11
mov.b   r12, r11       ; copy the low byte, zero the upper byte of R11 (not that it matters)
rrum     #4, r11       ; costs 4 cycles for shift-count = 4
xor     r11, r12       ; low 4 bits ^= (high 4 bits >> 4)

mov.b   r12, r11
rrum     #2, r11       ; costs 2 cycles for shift-count = 2
xor     r11, r12       ; narrow again to 2 bits

mov.b   r12, r11
rrum    #1,  r11       ; costs 1 cycle for shift-count = 1.  
xor     r11, r12       ; narrow again to 2 bits

and      #1, r12       ; clear high garbage from the high bits.

; ret  if this isn't inline

You could do this with a loop, e.g. use the popcount loop and do and #1, r12 at the end.

I feel like maybe we could save instructions if we shifted left (by 4 then 2) and did the last step (shift by 1) with an add.b r12,r12, because signed overflow (the V flag) = carry_in XOR carry_out for the sign bit. With both inputs the same for add, the existing sign bit will always be 0+0=00 or 1+1=10, so the sign bit = carry_in to the sign bit.

So with a bit-pattern like r12.b = XY??????, add.b r12,r12 sets V = X^Y, the horizontal XOR of the top two bits of the input. Because Y is the carry-in to the MSB, and X is the carry-out.

This would be good if you want to branch on it, but MSP430 doesn't seem to have a jXX that branches on V being set or not. It has JL and JGE which branch on (N XOR V) (i.e. signed compare), but N will be equal to the MSB, so N ^ V is just C, after our left-shift V sets V = N ^ C. I guess you'd have to get the flag word out of the flag register and shift/mask it! Or test that flag bit and JNZ.

score 0 · Answer 2 · answered Jan 15 '19 at 18:41

0

On most computers there is no hardware to do that in a couple of instructions.

What you have to do is a set of mask and shifts :

unsigned char to_count, nbr=0, mask=0x1, m;
for (int i=0; i<8; i++) {
    m = to_count&mask ; //1 if LSB=1, 0 otherwise
    nbr += m;
    to_count >>=1 ;
}

For a larger number of bits, you can have smarter strategies to reduce computation time statistically, but for 8 bits you will have no gain.

answered Jan 15 '19 at 18:41

Alain Merigot

10,667
3
18
31

1

table lookups are a tradeoff and you can do a 4 bit table look up twice with it only costing you 16 bytes of table. how much code do you end up with though as you still have to mask and shift and compute the table offset and do the read, the loop of 8 is probably still cheaper. – old_timer Jan 15 '19 at 18:54
"Most computers" do have it, if you mean desktops / laptops / phones. All x86 since Nehalem and (some) K10 have it. ARMv7 has it (for SIMD only `vcnt` / `cnt), and compilers think it's worth going integer->NEON->integer to use it for AArch64, but not for ARM32 (https://godbolt.org/z/DKjgie). But yes, MSP430 does *not* have it in hardware. If you count all microcontrollers in the world as "computers" in their own right, then sure probably more than half of the world's CPUs don't have it. :P And lots of x86 software probably doesn't take advantage, because it's not quite baseline yet. – Peter Cordes Jan 16 '19 at 06:50
If you're going to loop one bit at a time, you want to take advantage of the carry flag by shifting into carry and using ADC. (You've posted a few MIPS answers, so maybe you're used to an ISA that doesn't have flags?) This function compiles to pretty poor code with gcc and clang for MSP430, but I can't get them to use ADC the way they will for x86. https://godbolt.org/z/NQqPGf (using the `tmp = a+a; carry = tmp – Peter Cordes Jan 16 '19 at 07:31

score -1 · Answer 3 · answered Jan 15 '19 at 21:25

-1

This logic could be slightly shorter than looping:

unsigned char popcnt(unsigned char a)
{
    a = a - ((a >> 1) & 0x55);            // 2 bit fields 0 -> 2
    a = (a & 0x33) + ((a >> 2) & 0x33);   // 4 bit fields 0 -> 4
    a = (a & 0x0f) +  (a >> 4);           // a = bit count
    return a;
}

answered Jan 15 '19 at 21:25

rcgldr

27,407
3
36
61

MSP430 has limited shifting, but it turns out it can right-shift with an immediate count of 1..4 (https://www.win.tue.nl/~johanl/educ/RTcourse/MSP430%20-%20general.pdf), vs. only one bit per instruction with many other shifts. Unfortunately compilers are dumb (gcc and clang for MSP430) and don't do a good job, e.g. using an arithmetic right shift repeatedly. https://godbolt.org/z/NQqPGf. For compact code you'd obviously want to shift into carry and then ADC into an accumulator, but I can't get gcc to do that either. – Peter Cordes Jan 16 '19 at 07:27
@PeterCordes - My answer describes the logic. My assumption is that this would be written in assembly, but I don't know MSP430 instruction set, and I also made it clear that it could be shorter, but again I don't know MSP430. – rcgldr Jan 16 '19 at 19:23
I understood that. My point is that this logic probably isn't an efficient way to popcount on MSP430, even if you hand-optimize the asm. Shift counts of more than 1 apparently do take 1 cycle per count, even if you use a shift-by-4 instruction, according to that PDF I found. For 8 bits, I think a loop that stops when `a` becomes zero is probably a good bet. My answer shows you can get it down to 3 insns in the loop. (It's not my downvote, BTW.) – Peter Cordes Jan 16 '19 at 19:25
@PeterCordes - it was just a suggestion. At 15,000+, I don't care about downvotes anymore. – rcgldr Jan 16 '19 at 19:30

Need to count number of 1's in binary with assembly

3 Answers3