Assembly ARM programming/ Monocycle processor instruction

Question

The code below is supposed to calculate a geometric progression (its ratio equal to 2) and it should display 10 numbers of serie. A reference input number must be choosen in power of 2 (in the case the choosen number is 7). There are some implementing requirements which are :

I) All string values must be stored in memory from a base position X * (X 100 times 100), with offset needed to receive all 32-bit values.

II) The final value should be stored, in addition to the last position of the data memory in RX register.

III) The program should check if the values are less than 2,147,483,648 progression(1000000000000000000000000000000)2

IV) Must implement TST instructions, LSL and CMP by once, in addition to the instructions already implemented in ARM unicycle that can beused according to need.

START
        AND     R5, R5, #0      ;Reset Registers
        AND     R7, R7, #0
        AND     R0, R0, #0
        AND     R4, R4, #0

        ADD     R5, R5, #1      ;R5 receives de base number of GP
        ADD     R7, R7, R5      ;R7 = register of reference number
        ADD     R0, R0, #200
        ADD     R0, R0, #200
        ADD     R0, R0, #200
        ADD     R0, R0, #100    ;sets the memory base position in 700

        ADD     R4, R4, #1      ;starts the count
        STR     R7, [R0]        ;saves the first GP value in the memory

LOOP
        TST     R7, #2147483648 ;check if GP values is higher than 2^31
        BNE     FIM             ;if so, ends the code
        LSL     R7, R7, #1      ;else, multiply by 2
        ADD     R4, R4, #1      ;increments the count
        ADD     R0, R0, #4      ;increments memory adress
        STR     R7, [R0]        ;saves value in the memory
        CMP     R4, #10         ;check if 10 interations have been executed
        BEQ     FIM             ;if so, ends the code
        B       LOOP            ;else, restarts loop loop
END

Since I'm a 'newbie' in ASM Programming, I was wondering, what can be improved in this code, respecting the four requirements stated before? ANY change would be nice, the best way to learn something, is by looking to it in different angles right? Thanks in advance

score 2 · Accepted Answer · answered Jul 02 '18 at 00:54

Your loop branch should be cmp / bne loop, so you fall through to end the loop. That means one fewer branch instruction inside the loop. See Why are loops always compiled into "do...while" style (tail jump)?.

Also, take advantage of setting flags with instructions you already need, instead of using separate TST or CMP instructions.

If you're going to use a counter separate from the output pointer, count it down towards zero so you can subs r4, r4, #1 / bne.

There are a lot of missed optimizations in your code, especially your insane way of creating constants in registers. ARM has a real mov instruction; use it instead of ANDing or ADDing with zero.

Have a look at what a good C compiler will do: compiler output is often a good starting point for optimization, or a way to learn tricks for the machine you're targeting. (See also How to remove "noise" from GCC/clang assembly output? and Matt Godbolt's CppCon2017 talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”.)

Your version stores the first element without checking its high bit, so if the input had only the high bit set, you'd store another 9 zeros. IDK if that's what you wanted, or if that's a case you don't have to handle. (i.e. maybe your inputs are guaranteed to be non-negative signed numbers).

// uint32_t would be more portable, but ARM has 32-bit unsigned int
void store_sequence(unsigned val)
{
    unsigned *dst = (unsigned *)700;
    unsigned *endp = dst + 10;

    // val = 1;   // use the function arg so it's not a compile-time constant
    for (; dst < endp; dst++) {
        *dst = val;    // first store without checking the value
        val <<= 1;
        if (val >= (1UL << 31))
            break;
    }
}

Checking val right after shifting it gives nice asm: otherwise the compiler doesn't always take advantage of using the shift to set flags. Note that even though it's a for() loop, the compiler can prove that the condition is true the first time, and doesn't add an extra check / branch at the top to see if the loop should run zero times.

I put this code on the Godbolt compiler explorer to get gcc and clang output for ARM.

gcc7.2.1 -O3 fully unrolls the loop. With a larger count, it eventually decides to make a loop, but the unrolled loop is interesting: no pointer increment is required if you fully unroll. Using a different shift-count to re-shift the original also creates instruction-level parallelism (a CPU could run multiple shift instructions in parallel because there's no dependency on the result of the previous one.)

Notice that lsls sets flags from the shift, and ARM's flags include a N flag that's set if the high bit of the result is set. The MInus condition is true if N==1. The name comes from 2's complement negative numbers, but everything is just bits and you can use it to branch on the high bit. (The PLus condition is strangely named: it's true for non-negative results including zero, i.e. it only checks that N==0. https://community.arm.com/processors/b/blog/posts/condition-codes-1-condition-flags-and-codes)

Instead of an actual bmi (branch if minus), the compiler decided to use a predicated bx lr. i.e. return if MInus, otherwise it runs as a NOP. (Using -mcpu=cortex-a57 results in a bmi to the bottom of the loop, with a bx lr there. Apparently the tuning options for that microarchitecture make gcc avoid predicated bx instructions.)

@ On function entry, val is in r0.  Use  mov r0, #1  if you want
@ from gcc7.2.1 -O3
store_sequence:
    mov     r3, #0             @ this is the most efficient way to zero a reg

    lsls    r2, r0, #1         @ instruction scheduling: create r2 early
    str     r0, [r3, #700]     @ gcc just uses offsets from a zeroed reg

    bxmi    lr                 @ if(val<<1 has its high bit set) return;

    lsls    r1, r0, #2
    str     r2, [r3, #704]     @ 2nd store, using val<<1 after checking it
    bxmi    lr

    lsls    r2, r0, #3         @ alternating r1 and r2 for software pipelining
    str     r1, [r3, #708]     @ 3rd store, using val<<2 after checking it
    bxmi    lr
...

To get a rolled-up loop, you can increase the loop count, or compile with -Os (optimize for code-size).

With endp = dst+100 and gcc -O3 mcpu=cortex-a57 (to avoid bxmi lr), we get an interesting loop that's entered by jumping in to the middle, so it can fall through at the bottom. (In this case it probably would have been more efficient just to let cmp / beq run the first iteration, or to put cmp/bne at the bottom. -Os does the latter.)

@ gcc -O3 -mcpu=cortex-a57     with loop count = 100 so it doesn't unroll.
store_sequence:
    mov     r3, #700
    movw    r2, #1100         @ Cortex-A57 has movw.  add would work, too.
    b       .L3

.L6:                          @ do {
    cmp     r3, r2
    beq     .L1                  @ if(p==endp) break;
.L3:                                 @ first iteration loop entry point
    str     r0, [r3]
    lsls    r0, r0, #1           @ val <<= 1
    add     r3, r3, #4           @ add without clobbering flags
    bpl     .L6               @ } while(val's high bit is clear)
.L1:
    bx      lr

With -Os, we get a better-looking loop. The only downside is that bmi (or bxmi lr) reads flags right away in the next instruction after lsls sets flags. You could schedule the add between them, though. (Or in Thumb mode you'd want to do it this way, because adds has a shorter encoding than add.)

@ gcc7.2.1 -Os  -mcpu=cortex-a57
store_sequence:
    mov     r3, #700               @ dst = 700
.L3:                            @ do{
    str     r0, [r3]
    lsls    r0, r0, #1            @ set flags from val <<= 1
    bxmi    lr                    @  bmi  to the end of the loop would work

    add     r3, r3, #4            @ dst++
    cmp     r3, #740
    bne     .L3                 @ } while(p != endp)

 @ FIM:
    bx      lr

With larger endp that won't fit in an immediate operand for cmp, gcc calculates it in a reg outside the loop.

It always uses mov, or loads it from a literal pool in memory, instead of using add r2, r3, #8192 or something. I'm not sure I constructed a case where an immediate for add would work but an immediate for movw wouldn't.

Anyway, regular mov works for small immediates, but movw is a newer encoding that's not baseline, so gcc only uses movw when you compile with -mcpu= something that has it.

score 1 · Answer 2 · answered Jul 01 '18 at 20:21

This looks like homework, so I'm only giving you some hints.

Well, you need a label somewhere for FIM. I'll assume that it's at the end of your code.

Forcing a register to be zero and then adding small immediate values to it looks kind of like MIPS code. Learn about what values can be used as immediates and study the MOV and MOVW instructions.

Having a conditional branch around an unconditional branch is poor programming, replace with a single conditional branch.

Learn about more advanced options for the STR instruction so you don't need an extra instruction to adjust your address pointer.

Assembly ARM programming/ Monocycle processor instruction

2 Answers2