Debugging iterative Fibonacci (converted by hand from C to RISC-V)

Question

I'm trying to learn assembly so I tried to convert this c code to RISC-v assembly code:

int Fib_Iter(int x){

int Fib_f, Fib_s, temp, Result;
Fib_f = 0;
Fib_s = 1;
if (x == 0)
    Result = Fib_f;
else if (x == 1)
    Result = Fib_s;
else
{
    while (x >= 2)
    {
        temp = Fib_f + Fib_s;
        Fib_f = Fib_s;
        Fib_s = temp;
        x--;
    }
    Result = Fib_s;
}
return Result;
}

and this is my RISC-V assembly code:

   fib_iter:
    #temp in x5,
    #f_f in x6
    #f_s in x7
    #x in x12
    #res in x11
    #x28 =1
    #x29=2
    addi sp,sp,-16
    sw x5,0(sp)
    sw x6,4(sp)
    sw x7,8(sp)
    sw x11,12(sp)              #saving x5,x6,x7,x11 in the stack
    addi x28,x0,1             #adding 1 to x28 to use it in the 2ndif statment
    addi x29,x0,2              #adding 2 to x28 to use it in the 3rd if statment
    bne x12,x0,lb1             #first if statment to check if x==0
    add x11,x6,x0               #if x !=0 make result(x11) equal to (fb_f)x6
    lb1:
    bne x12,x28,lb2           #2nd if statement to check if x==1
    add x11,x7,x0            #if x !=1 make result(x11) equal to (fb_s)x6
    lb2:                      #if it equals to 1 start the loop
    loop:
    add x5,x6,x7             #just an add's 
    addi x6,x7,0
    addi x7,x5,0
    addi x12,x12,-1
    bge x12,x29,loop       #check if x >2 if yes go to the loop else 
    addi x11,x7,0          #result(x11)=x7
    lw x5,0(sp)            #load all the registers from the stack
    lw x6,4(sp)
    lw x7,8(sp)
    lw x11,12(sp)
    addi sp,sp,16         #return the stack pointer to its original place
    jalr x0,0(x1)

but I'm not getting the right value in Register 11 when using the venus simulator.

When I call it using the value 4, I got 4 but the right answer is 3.

What value do you get? What value do you expect to get? Please consider annotating your code with some comments. Right now, it's hard to read due to the poor formatting and hard to understand due to the absence of comments. — fuz, Dec 19 '20 at 17:51
when i call it using the value 4 i got 4 but the right answer is 3, alright i will edit it asap — Jhon abu hameda, Dec 19 '20 at 18:01

Erik Eidt · Answer 1 · 2020-12-19T18:07:08.143

You have pseudo code in C, which is great.

Comments:

There are some missing statements in the translation to assembly.
The if-then-else statements aren't right: only one of then/else should be executed, and you have to tell the processor that by avoiding the else part from the then part.
The register usage is off from the standard calling convention. We expect the first integer parameter in a0, so that we would expect your x to be in a0. The return value should also be in a0. There's no need to save t0,t1,t2,a1 — they are call clobbered just like t3,t4 (which you aren't saving and don't have to).

If you want to have your own calling convention, that's fine, but you're returning a value in a1 and also restoring a1 from the stack, and those don't make sense together.

See comments inline:

    int Fib_Iter(int x) {   <------------- x is passed in a0
        int Fib_f, Fib_s, temp, Result;
        Fib_f = 0;          <------------- where is this line in the assembly??
        Fib_s = 1;          <------------- where is this line in the assembly??
        if (x == 0)
            Result = Fib_f;
        else if (x == 1)
            Result = Fib_s;
        else
        {
            while (x >= 2)
            {
                temp = Fib_f + Fib_s;
                Fib_f = Fib_s;
                Fib_s = temp;
                x--;
            }
            Result = Fib_s;
        }
        return Result;   <--------- return value goes in a0
    }

See comments inline:

   fib_iter:
    #temp in x5,
    #f_f in x6
    #f_s in x7
    #x in x12       <---- x should be found in a0
    #res in x11     <---- return value should be put in a0 (at the end of the function)
    #x28 =1
    #x29=2
    addi sp,sp,-16  <---- no stack space needed
    sw x5,0(sp)     <---- no need to save t0
    sw x6,4(sp)     <---- no need to save t1
    sw x7,8(sp)     <---- no need to save t2
    sw x11,12(sp)   <---- no need to save a1
    addi x28,x0,1
    addi x29,x0,2
    bne x12,x0,lb1
    add x11,x6,x0   <---- then part, good
                    <---- missing code to skip else part
                       after executing a then part the code
                       (should skip the else part)
                       should resume the logically next thing after the if
    lb1:
    bne x12,x28,lb2
    add x11,x7,x0
                    <---- missing code to skip else part
    lb2:
    loop:
    add x5,x6,x7
    addi x6,x7,0
    addi x7,x5,0
    addi x12,x12,-1
    bge x12,x29,loop
    addi x11,x7,0
    lw x5,0(sp)      <---- no need to reload t0
    lw x6,4(sp)      <---- no need to reload t1
    lw x7,8(sp)      <---- no need to reload t2
    lw x11,12(sp)    <---- no need to reload a1 (this also clobbers current a1 return value)
    addi sp,sp,16    <---- no stack space is needed
    jalr x0,0(x1)

Thanks, A lot bro, a have a question what is the missing code to skip else part I really don't know how to skip it I thought the program will skip it on its own and that is stupid — Jhon abu hameda, Dec 19 '20 at 18:12
you need to change the flow of control there, unconditionally. that is done with a branch, but you also have to know where to branch to, and that is to the logically next thing after the entire if-statement, which is basically the } that is the end of the function — Erik Eidt, Dec 19 '20 at 18:14
Oh, unconditional branches in RISC V are usually referred to as jumps. — Erik Eidt, Dec 20 '20 at 01:12

Peter Cordes · Accepted Answer · 2020-12-20T03:20:18.603

Just for the record, your C can be simplified. You don't need to separately check for x == 0 and x == 1; you can do if (x < 2) return x; to return either 0 or 1.

(I assume you don't intend to handle negative inputs, so unsigned might have been a good choice. Your C returns 1 for negative x, reaching the loop but running 0 iterations, leaving Fib_s unmodified. But I assume that's not important behaviour.)

A minimal implementation in asm can be much simpler than your version. This is a leaf function (no calls to other functions) so we can use call-clobbered ("temporary") registers for everything. I used the "ABI names" for registers to help keep track of which are traditionally used for arg-passing and call-clobbered vs. call-preserved.

Actually I got good asm from clang, for this C:

int Fib_Iter(int x){
    if (x < 2) 
        return x;

    int Fib_f = 0, Fib_s = 1;
    while (x >= 2) {
        int temp = Fib_f + Fib_s;
        Fib_f = Fib_s;
        Fib_s = temp;
        x--;
    }
    return Fib_s;
}

Godbolt compiler explorer, RISC-V clang -O3

# clang -O3 output:
# arg:     int x      in  a0
# returns: int Fib(x) in a0

    Fib_Iter:
        addi    a1, zero, 2
        blt     a0, a1, .LBB0_3         # if(x<2) goto ret with x still in a0 as the retval
                # otherwise fall through to the rest and init vars
        mv      a3, zero                # Fib_f = 0
        addi    a2, a0, 1               #  tmpcount = x+1  compiler invented this
        addi    a0, zero, 1             # Fib_s = 1
          # x>=2 is known to be true on first iteration so a standard do{}while() loop structure works
.LBB0_2:                                # do{
        add     a4, zero, a0                # tmp = Fib_s
        addi    a2, a2, -1                  # tmpcount--
        add     a0, a0, a3                  # Fib_s += Fib_f
        add     a3, zero, a4                # Fib_f = tmp
        blt     a1, a2, .LBB0_2         # }while(2<tmpcount);
.LBB0_3:
        ret

Same logic should work for unsigned, avoiding the weirdness of returning a negative x. clang compiles it somewhat differently with unsigned types, but I don't think that's necessary.

The tmpcount = x+1 can probably be avoided using ble (reversed operands for bge) instead of blt so we can use 2 and x directly, saving another instruction.

Fibonacci unrolls very nicely: a += b; b += a; takes one instruction per step, not 3. Checking the branch condition at each step could actually be best for static code size, as well as much better for dynamic instruction count. (Related: an x86 asm answer that stores an array of Fibonacci values, including an unrolled version that only checks the branch condition once per loop, using clever startup to handle odd vs. even before entering the loop).

(Of course if you're optimizing for non-tiny n, evaluating Fibonacci(n) can be done in log2(n) shift and multiply / add steps, instead of n additions, using fancier math.)

This is with an unroll that just repeats the loop condition. The loop exit logic is non-trivial to verify for correctness, though, so it's shorter but not simpler.

# arg:   unsigned n  in  a0
# returns:    Fib(n) in  a0

fib_unrolled:
        addi    t2, zero, 2
        bltu    a0, t2, .Lsmall_n         # if(n<2) return n
                # otherwise fall through
        mv      t0, zero               # a=0
        addi    t1, zero, 1            # b=1
             # known: x>=2  (unsigned) before first iteration
.Lloop:                                # do{
        beq     a0, t2, .first_out          # if(n==2) return a+b;
        addi    a0, a0, -2                  # n-=2
        add     t0, t0, t1                  # a += b
        add     t1, t1, t0                  # b += a
        bgeu    a0, t2, .Lloop         # }while(n >= 2);

        mv      a0, t1
.Lsmall_n:
        ret

.Lfirst_out:
        add      a0, t0, t1      # add instead of mv so the beq can be earlier
        ret

I was able to get clang to almost exactly reproduce this from C source (with different register numbers, but the exactly same loop order.) Including putting the add a+b block after the regular fall-through ret. It schedules the instructions in the loop body better than I did, separating the two fib sequence additions if we're assuming a dual-issue in-order pipeline. However, clang still insists on wasting an instruction loading a 1 constant by turning n >= 2 into n > 1; RISC-V can do bgeu as a reversed-operands bltu (https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf) and it already has 2 in a register.

unsigned fib_unroll(unsigned n){
    if (n < 2) 
        return n;

    unsigned a = 0, b = 1;
    do {
        if (n == 2) return a+b;
        a += b;
        n -= 2;          // clang schedules this way, better for an in-order pipeline
        b += a;
    }while (n >= 2);
    return b;
}

A more clever unroll: one branch inside the loop. Init the two vars based on count being odd or even

If we want to always do an even number of additions, we can arrange to start with either 0, 1 or 1, 0, depending on n & 1. Starting with 1,0 means we first do 1+=0, basically skipping an iteration. (I originally came up with this trick for this x86 Fibonacci answer)

unsigned fib_unroll2_simpler(unsigned n){
    if (n < 2) 
        return n;
    unsigned b = n&1;  // current
    unsigned a = b^1;  // prev
    // start with 0,1 or 1,0 to get either 1 or 2 after two additions
    do {
        n -= 2;
        a += b;
        b += a;
    }while (n >= 2);
    return b;
}

This has a long data dependency from n to the result, especially for smallish n, doing some basically wasted work. Not great on wide out-of-order exec machines for small n. So it's interesting but for a real use-case you might still want a table lookup for small-n starting points. clang does a very reasonable job, but wastes some instructions around the start:

fib_unroll2_simpler:
        addi    a2, zero, 2
        add     a1, zero, a0
        bgeu    a0, a2, .LBB0_2
        add     a0, zero, a1            # copy `n` back into a0 where it already was?!?
        ret
.LBB0_2:                                # the non-tiny n common case has a taken branch
        andi    a0, a1, 1
        xori    a2, a0, 1
        addi    a3, zero, 1          # constant 1 to compare against
.LBB0_3:                                # =>This Inner Loop Header: Depth=1
        addi    a1, a1, -2
        add     a2, a2, a0
        add     a0, a0, a2
        bltu    a3, a1, .LBB0_3      # }while(1<n); fails to reuse the 2 it already had in a2 earlier
        ret

Depending on the cost of branching, it might be better to branch into the middle of the loop to start things off. This also means we can always start with 1,1 when we enter the loop, instead of spending an iteration adding zeros. But that makes n==2 a special case: we need to return 1, and can't do any additions of 1+1. But 1 is one of our special-case return values, so we can tweak that path to return n != 0 and let the rest of the function assume n >= 3 or higher.

With some further optimization to minimize instruction count for RISC-V (e.g. avoiding the need to construct a constant 2 in a register to shorten the non-tiny-n common case), I came up with this. (A _v1 version is in the Godbolt link)

unsigned fib_unroll2_branchy_v2(unsigned n){
    if (n <= 2) 
        return n!=0;  // 0 or 1
    n -= 3;     // check for n<=2 and copy n on a machine without FLAGS
    unsigned i = n&1;

    unsigned b = 1;
    //if (n==2) return b;  // already eliminated.
    unsigned a = 1;

    if (i == 0) goto odd_entry;   // n-=3 flips the low bit, so this detects odd n
    do{
        a += b;
odd_entry:
        i += 2;
        b += a;
    }while (i <= n);  // safe even for n near uint_max because we subtracted 3 first
    return b;
}

clang doesn't do an optimal job here, wasting some copy instructions in the loop that we conditionally jump into. (Compilers often have a hard time when you do that in C, but it's sometimes a useful trick for hand-written asm). So instead, here's a hand-written version that doesn't suck as much:

fib_unroll2_branchy_v2:
        addi    t2, a0, -3            # n -= 3  (leaving a copy of the orig)
        bleu    t2, a0, .Lsmall_n     # if( (n-3) > n) detect wrapping, i.e. n<=2

        andi    t0, t2, 1             # i = n&1
        addi    a0, zero, 1           # b = retval, replacing orig_n
        addi    a1, zero, 1           # a
        beqz    t0, .Lodd_entry       # even i means orig_n was odd

.Lloop:                              # do{
        add     a1, a1, a0            # a += b
.Lodd_entry:
        addi    t0, t0, 2             # i += 2
        add     a0, a0, a1            # b += a
        bleu    t0, t2, .Lloop       # }while(i <= n);
        ret

.Lsmall_n
        snez    a0, a0                # return orig_n != 0 handles n<3
        ret

There may be a few optimizations I missed. In fib_unroll2_simpler (the branchless one), it would be nice to find some ILP (instead of basically one long dependency chain apart from eventually n-=2), or get a jump-start on reaching the loop termination by doing fewer iterations instead of turning the first half of the loop into a no-op. This version just needs the final result, doesn't need to store every Fib value along the way into an array like my x86 answer did.

Even the branchy_v2 version feels like a longer dep chain than we'd really like to init i, but it'll be fine on a not-super-wide pipeline.

Debugging iterative Fibonacci (converted by hand from C to RISC-V)

2 Answers2

A more clever unroll: one branch inside the loop. Init the two vars based on count being odd or even