popcount in arm assembly without neon

Question

I have read this as well as wiki I understand the following code should result in 12 instructions in asm.

 i = i - ((i >> 1) & 0x55555555);        // add pairs of bits
 i = (i & 0x33333333) + ((i >> 2) & 0x33333333);  // quads
 i = (i + (i >> 4)) & 0x0F0F0F0F;        // groups of 8
 return (i * 0x01010101) >> 24;          // horizontal sum of bytes

I am currently solving this problem, and my best correct solution so far executes 190 instructions total in 10 tests (19 per test).

// A test case to test your function with
.global _start
_start:
    mov r0, #5
    bl popcount
    1: b 1b    // Done

// Only your function (starting at popcount) is judged. The test code above is not executed.
    
popcount:

    PUSH    {r1,r2, r3, r4, r5, r6, r7, r8, r9, r10}
    
    ldr r2, =0x55555555 // 2bits
    ldr r3, =0x33333333 // 4bits
    ldr r4, =0x0F0F0F0F // 8bits
    ldr r6, =0x01010101 // 8bits
    
    
    //x -= (x >> 1) & m1;             //put count of each 2 bits into those 2 bits
    lsr r1, r0, #0x1 // one shift
    and r1, r1, r2
    sub r0, r0, r1

    //x = (x & m2) + ((x >> 2) & m2); //put count of each 4 bits into those 4 bits 
    lsr r1, r0, #0x2 // two shift
    and r1, r1, r3
    and r5, r0, r3  
    add r0, r5, r1
    
    //x = (x + (x >> 4)) & m4;        //put count of each 8 bits into those 8 bits 
    lsr r1, r0, #0x4 // two shift
    add r1, r1, r0
    and r0, r1, r4

    //return (i * 0x01010101) >> 24;
    mul r0, r0, r6
    lsr r0, r0, #0x18

    POP    {r1,r2, r3, r4, r5, r6, r7, r8, r9, r10}
    
    bx lr

The best running score so far is 120 instructions. But here the problems are:

I need to push/pop not to clobber registers (2 instructions)
I need the LDR instructions because constants are too big for immediate (4 instructions), also no rotation would work.
I need to return from function (1 instruction)
Neon SIMD instructions are not available (I tried)

These are precisely the 7 extra instructions per test.

How would you proceed to get only 12 instructions executed?

What variant of the ARM architecture are you programming for? Can you assemble his code in thumb mode? Can you use Thumb2 instructions? — fuz, Aug 21 '22 at 14:02
Thumb2 can use repeated patterns like 0x55555555 as an immediate for `and`. Instructions without an immediate can apply a shift to one of the operands as part of one instruction. ("flexible second operand" / barrel shifter. See the examples on https://www.davespace.co.uk/arm/introduction-to-arm/barrel-shifter.html) — Peter Cordes, Aug 21 '22 at 14:04
@PeterCordes If it is for [this exercise](https://cpulator.01xz.net/doc/#sim_processors), it seems Thumb mode is not supported. — fuz, Aug 21 '22 at 14:05
My recommendation to you: learn about the flexible third operand to join multiple instructions into one. Instead of just pushing and popping all sorts of registers, read up on the calling convention and only save/restore those registers you really need to. Most likely, you do not have to save any of those you use. — fuz, Aug 21 '22 at 14:06
Compiler output is often a good starting point. e.g. from the SO answer you linked, https://godbolt.org/z/sa58WaWYT clang and GCC with various tuning / architecture options make different choices. Or with `-Oz` / `-Os` to optimize for code-size instead of speed: https://godbolt.org/z/dhYv7qM8n - clang then avoids movw/movt even when it's available, instead still using PC-relative loads from memory. — Peter Cordes, Aug 21 '22 at 14:11
@fuz Thanks, I'll try. It is not clear what I can use. System is defined as armv7 generic in the simulator and neon is not supported. So likely only the least common denominator of armv7 plateforms. — user2346536, Aug 21 '22 at 14:21
Godbolt code runs the test in 140 instructions. Your manual assembler in 190. Here we have evidence that compilers are better than humans. — 0___________, Aug 21 '22 at 15:33
@0___________ Compilers are better than beginners at assembly programming, certainly. — fuz, Aug 21 '22 at 16:16
That site is utterly useless: It can't handle all `ARMv7` instructions, plus it only counts number of executed instructions and NOT the really important cycles. It's just an outdated `ARMv5` simulator disguised as an `ARMv7` one. What a scam! — Jake 'Alquimista' LEE, Aug 21 '22 at 20:35

Jake 'Alquimista' LEE · Answer 1 · 2022-08-21T19:18:22.013

0

movw    r1, #0x5555
movw    r2, #0x3333
movt    r1, #0x5555
movt    r2, #0x3333

and     r3, r0, r1
bic     r12, r0, r1

add     r0, r3, r12, lsr #1
movw    r1, #0x0f0f

and     r3, r0, r2
bic     r12, r0, r2
add     r0, r3, r12, lsr #2
movt    r1, #0x0f0f

add     r0, r0, r0, lsr #4
mov     r2, #0
and     r0, r0, r1
usad8   r0, r0, r2

bx      lr

4+4 won't spill 4 bits, hence you don't need and operations before adding.
usad8 (unsigned sum of absolute difference 8bits) with zero does the trick adding 4bytes.

The routine above takes 12 cycles on Cortex-A8 and 15 cycles on the weaker Cortex-A7. Scheduling is the key so that as many instructions as possible get dual-issued.

edited Aug 21 '22 at 19:18

answered Aug 21 '22 at 19:12

Jake 'Alquimista' LEE

6,197
2
17
25

It does not count cycles only instructions. And it does not pass tests on the OP's webpage. – 0___________ Aug 21 '22 at 19:20
1

@0___________ That webpage has problems with `usad8` opcode. It seems that webpage is stuck with `ARMv5` era. Claiming being a tutorial for `ARMv7` is outright a scam since `usad8` has been available since `ARMv6` – Jake 'Alquimista' LEE Aug 21 '22 at 20:14

score -1 · Answer 2 · answered Jan 31 '23 at 15:54

-1

Can be rewritten for 15 instructions like this

popcount:

PUSH    {r1,r2, r3, r4, r5, r6, r7, r8, r9, r10}     

//x -= (x >> 1) & m1;             //put count of each 2 bits into those 2 bits
lsr r1, r0, #0x1 // one shift
and r1, r1, #0x55555555
sub r0, r0, r1

//x = (x & m2) + ((x >> 2) & m2); //put count of each 4 bits into those 4 bits 
lsr r1, r0, #0x2 // two shift
and r1, r1, #0x33333333
and r5, r0, #0x33333333
add r0, r5, r1

//x = (x + (x >> 4)) & m4;        //put count of each 8 bits into those 8 bits 
lsr r1, r0, #0x4 // two shift
add r1, r1, r0
and r0, r1, #0x0F0F0F0F

//return (i * 0x01010101) >> 24;
mul r0, r0, #0x01010101
lsr r0, r0, #0x18

POP    {r1,r2, r3, r4, r5, r6, r7, r8, r9, r10}

bx lr

answered Jan 31 '23 at 15:54

Gary

99
2
10

1

What CPU or mode can this assemble for? `arm-none-eabi-gcc -c arm-pop.S` won't assemble it for ARM mode; `Error: ARM register expected -- 'mul r0,r0,#0x01010101'` , also `Error: invalid constant (55555555) after fixup`. Maybe you were thinking Thumb mode? `.thumb` / `.cpu cortex-a53` / `.syntax unified` makes everything assemble except the `mul`-immediate (which I don't think exists at all in ARM or Thumb). But apparently this assignment / exercise doesn't allow Thumb mode, only ARM mode, which is why Jake used movw/movt instead of immediates. – Peter Cordes Jan 31 '23 at 16:42
Peter, good point on MUL not taking immediate values . "ldr r6, =0x01010101" is still required. So we are down to 16 instructions when we use immediate values for the AND instructions. – Gary Feb 01 '23 at 10:18

popcount in arm assembly without neon

2 Answers2