1

I have found the following code for counting bits in a 4-bytes integer in Arm assembly in the minimal number of instructions:

  ;R0 - value
  ;R1 = number of ones
  ;Uses R0-R1
  ;81 cycles worst case, 4 cycles best case, exit when r1=0

mov    r1,r0,lsr #31
loop   movs     r0,r0,lsl #2    
       adc  r1,r1,r0,lsr #31    
       bne  loop
       mov  pc,r14

Can you please explain what is the algorithm here? I can't understand it though I know what all instructions do.

Arcady27
  • 13
  • 1
  • 4
  • 1
    Possible duplicate of [Fastest way to count number of 1s in a register, ARM assembly](http://stackoverflow.com/questions/15736602/fastest-way-to-count-number-of-1s-in-a-register-arm-assembly); I understand this is **NOT** an exact duplicate, but how can someone look at this question and not want to see the other (besides the OP)? – artless noise Mar 14 '16 at 13:40
  • @artlessnoise: I incorporated that link into my answer; thanks for digging it up. I don't think this question should be marked as a duplicate; your comment link is enough. If this technique is the best choice for any ARM core, it should be added as an answer to the other question. Maybe one without a CLZ instruction and where a LUT is undesirable? The shift direction of the loop should be chosen depending on whether the common case is to have set bits only near the top or only near the bottom. – Peter Cordes Mar 14 '16 at 13:55

1 Answers1

2
       mov    r1,r0,lsr #31        @ start with r1 = the high bit of r0 (right shift by 31 bits)
loop   movs     r0,r0,lsl #2       @ left shift r0 by 2, and set flags on the result
       adc  r1,r1,r0,lsr #31
       bne  loop                   @ loop if r0 is non-zero (testing flags set by movs)

The add-with-carry is a neat trick: r0 >> 31 is the high bit, and the carry flag is the bit that was shifted out by the movs r0,r0,lsl #2 (I assume ARM works this way like x86, otherwise the algorithm doesn't make sense.)

So it adds 2 bits per iteration to the popcnt total: the high bit, and the last bit shifted out.


This is not the fastest way to popcnt.

A previous question: Fastest way to count number of 1s in a register, ARM assembly addressed this, with one answer using vcnt.8 d0, d0 and then a horizontal sum of the four 8bit counts. None of the answers there mentions this two-bits-at-a-time loop.


Even without a LUT (e.g. 8-bit LUT to look up the count for each byte) or a dedicated popcnt instruction, there are bithacks, like this one from Sean Anderson's collection:

v = v - ((v >> 1) & 0x55555555);                    // reuse input as temporary
v = (v & 0x33333333) + ((v >> 2) & 0x33333333);     // temp
c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; // count

constant-time no branches, but does require several 32bit constants (takes two instructions each to get into regs on ARM)

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thank you, @Peter! To clarify: by shift by 2 bits (`movs r0,r0,lsl #2`) why only one bit is shifted out and stored as carry flag (not two bits are shifted out) ? – Arcady27 Mar 14 '16 at 00:23
  • @Arcady27: 2 bits are shifted out each iteration: one that we counted last iteration (previously the high bit), and one that ends up in the carry flag for this iteration. So `adc` and ARM's shift-on-the-fly operands are a nice trick to double the speed of the usual naive popcnt loop. (Half as many iterations before the input is zero.) – Peter Cordes Mar 14 '16 at 05:01