Just for the record, your C can be simplified. You don't need to separately check for x == 0
and x == 1
; you can do if (x < 2) return x;
to return either 0 or 1.
(I assume you don't intend to handle negative inputs, so unsigned
might have been a good choice. Your C returns 1
for negative x, reaching the loop but running 0 iterations, leaving Fib_s unmodified. But I assume that's not important behaviour.)
A minimal implementation in asm can be much simpler than your version. This is a leaf function (no calls to other functions) so we can use call-clobbered ("temporary") registers for everything. I used the "ABI names" for registers to help keep track of which are traditionally used for arg-passing and call-clobbered vs. call-preserved.
Actually I got good asm from clang, for this C:
int Fib_Iter(int x){
if (x < 2)
return x;
int Fib_f = 0, Fib_s = 1;
while (x >= 2) {
int temp = Fib_f + Fib_s;
Fib_f = Fib_s;
Fib_s = temp;
x--;
}
return Fib_s;
}
Godbolt compiler explorer, RISC-V clang -O3
# clang -O3 output:
# arg: int x in a0
# returns: int Fib(x) in a0
Fib_Iter:
addi a1, zero, 2
blt a0, a1, .LBB0_3 # if(x<2) goto ret with x still in a0 as the retval
# otherwise fall through to the rest and init vars
mv a3, zero # Fib_f = 0
addi a2, a0, 1 # tmpcount = x+1 compiler invented this
addi a0, zero, 1 # Fib_s = 1
# x>=2 is known to be true on first iteration so a standard do{}while() loop structure works
.LBB0_2: # do{
add a4, zero, a0 # tmp = Fib_s
addi a2, a2, -1 # tmpcount--
add a0, a0, a3 # Fib_s += Fib_f
add a3, zero, a4 # Fib_f = tmp
blt a1, a2, .LBB0_2 # }while(2<tmpcount);
.LBB0_3:
ret
Same logic should work for unsigned, avoiding the weirdness of returning a negative x
. clang compiles it somewhat differently with unsigned
types, but I don't think that's necessary.
The tmpcount = x+1
can probably be avoided using ble
(reversed operands for bge
) instead of blt
so we can use 2
and x
directly, saving another instruction.
Fibonacci unrolls very nicely: a += b;
b += a;
takes one instruction per step, not 3. Checking the branch condition at each step could actually be best for static code size, as well as much better for dynamic instruction count. (Related: an x86 asm answer that stores an array of Fibonacci values, including an unrolled version that only checks the branch condition once per loop, using clever startup to handle odd vs. even before entering the loop).
(Of course if you're optimizing for non-tiny n
, evaluating Fibonacci(n) can be done in log2(n) shift and multiply / add steps, instead of n additions, using fancier math.)
This is with an unroll that just repeats the loop condition. The loop exit logic is non-trivial to verify for correctness, though, so it's shorter but not simpler.
# arg: unsigned n in a0
# returns: Fib(n) in a0
fib_unrolled:
addi t2, zero, 2
bltu a0, t2, .Lsmall_n # if(n<2) return n
# otherwise fall through
mv t0, zero # a=0
addi t1, zero, 1 # b=1
# known: x>=2 (unsigned) before first iteration
.Lloop: # do{
beq a0, t2, .first_out # if(n==2) return a+b;
addi a0, a0, -2 # n-=2
add t0, t0, t1 # a += b
add t1, t1, t0 # b += a
bgeu a0, t2, .Lloop # }while(n >= 2);
mv a0, t1
.Lsmall_n:
ret
.Lfirst_out:
add a0, t0, t1 # add instead of mv so the beq can be earlier
ret
I was able to get clang to almost exactly reproduce this from C source (with different register numbers, but the exactly same loop order.) Including putting the add a+b
block after the regular fall-through ret. It schedules the instructions in the loop body better than I did, separating the two fib sequence additions if we're assuming a dual-issue in-order pipeline. However, clang still insists on wasting an instruction loading a 1
constant by turning n >= 2
into n > 1
; RISC-V can do bgeu
as a reversed-operands bltu
(https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf) and it already has 2
in a register.
unsigned fib_unroll(unsigned n){
if (n < 2)
return n;
unsigned a = 0, b = 1;
do {
if (n == 2) return a+b;
a += b;
n -= 2; // clang schedules this way, better for an in-order pipeline
b += a;
}while (n >= 2);
return b;
}
A more clever unroll: one branch inside the loop. Init the two vars based on count being odd or even
If we want to always do an even number of additions, we can arrange to start with either 0, 1
or 1, 0
, depending on n & 1
. Starting with 1,0 means we first do 1+=0
, basically skipping an iteration. (I originally came up with this trick for this x86 Fibonacci answer)
unsigned fib_unroll2_simpler(unsigned n){
if (n < 2)
return n;
unsigned b = n&1; // current
unsigned a = b^1; // prev
// start with 0,1 or 1,0 to get either 1 or 2 after two additions
do {
n -= 2;
a += b;
b += a;
}while (n >= 2);
return b;
}
This has a long data dependency from n to the result, especially for smallish n, doing some basically wasted work. Not great on wide out-of-order exec machines for small n. So it's interesting but for a real use-case you might still want a table lookup for small-n starting points.
clang does a very reasonable job, but wastes some instructions around the start:
fib_unroll2_simpler:
addi a2, zero, 2
add a1, zero, a0
bgeu a0, a2, .LBB0_2
add a0, zero, a1 # copy `n` back into a0 where it already was?!?
ret
.LBB0_2: # the non-tiny n common case has a taken branch
andi a0, a1, 1
xori a2, a0, 1
addi a3, zero, 1 # constant 1 to compare against
.LBB0_3: # =>This Inner Loop Header: Depth=1
addi a1, a1, -2
add a2, a2, a0
add a0, a0, a2
bltu a3, a1, .LBB0_3 # }while(1<n); fails to reuse the 2 it already had in a2 earlier
ret
Depending on the cost of branching, it might be better to branch into the middle of the loop to start things off. This also means we can always start with 1,1
when we enter the loop, instead of spending an iteration adding zeros. But that makes n==2
a special case: we need to return 1, and can't do any additions of 1+1. But 1 is one of our special-case return values, so we can tweak that path to return n != 0
and let the rest of the function assume n >= 3 or higher.
With some further optimization to minimize instruction count for RISC-V (e.g. avoiding the need to construct a constant 2
in a register to shorten the non-tiny-n common case), I came up with this. (A _v1 version is in the Godbolt link)
unsigned fib_unroll2_branchy_v2(unsigned n){
if (n <= 2)
return n!=0; // 0 or 1
n -= 3; // check for n<=2 and copy n on a machine without FLAGS
unsigned i = n&1;
unsigned b = 1;
//if (n==2) return b; // already eliminated.
unsigned a = 1;
if (i == 0) goto odd_entry; // n-=3 flips the low bit, so this detects odd n
do{
a += b;
odd_entry:
i += 2;
b += a;
}while (i <= n); // safe even for n near uint_max because we subtracted 3 first
return b;
}
clang doesn't do an optimal job here, wasting some copy instructions in the loop that we conditionally jump into. (Compilers often have a hard time when you do that in C, but it's sometimes a useful trick for hand-written asm). So instead, here's a hand-written version that doesn't suck as much:
fib_unroll2_branchy_v2:
addi t2, a0, -3 # n -= 3 (leaving a copy of the orig)
bleu t2, a0, .Lsmall_n # if( (n-3) > n) detect wrapping, i.e. n<=2
andi t0, t2, 1 # i = n&1
addi a0, zero, 1 # b = retval, replacing orig_n
addi a1, zero, 1 # a
beqz t0, .Lodd_entry # even i means orig_n was odd
.Lloop: # do{
add a1, a1, a0 # a += b
.Lodd_entry:
addi t0, t0, 2 # i += 2
add a0, a0, a1 # b += a
bleu t0, t2, .Lloop # }while(i <= n);
ret
.Lsmall_n
snez a0, a0 # return orig_n != 0 handles n<3
ret
There may be a few optimizations I missed. In fib_unroll2_simpler
(the branchless one), it would be nice to find some ILP (instead of basically one long dependency chain apart from eventually n-=2
), or get a jump-start on reaching the loop termination by doing fewer iterations instead of turning the first half of the loop into a no-op. This version just needs the final result, doesn't need to store every Fib value along the way into an array like my x86 answer did.
Even the branchy_v2 version feels like a longer dep chain than we'd really like to init i
, but it'll be fine on a not-super-wide pipeline.