Recursive fibonacci Assembly

Question

Today I wrote a recursive fibonacci in assembly and it doesn't work. I compiled it to object file with NASM and than made it elf with gcc.
When I enter 1 or 2 the function works perfectly, but when I enter 3, 4, 5, 6, or more the function doesn't work. I think there is problem where the function calls itself.

This the code:

SECTION .data ;init data




str: db "This equal: %d",10,0

SECTION .text   ;asm code


extern printf
global main

main:
push ebp
mov ebp,esp
;--------------------


push 03  ; the index 
call _fibonacci
add esp,4

push DWORD eax
push str
call printf


;---------------------

mov esp,ebp
pop ebp
ret

This is the function:

_fibonacci:

push ebp
mov ebp,esp


mov ebx, [ebp+8] ;; param n 
cmp ebx,0
jne CHECK2

mov eax,0
jmp _endFIbofunc        

CHECK2: 
    cmp ebx,0x1
    jne ELSE3
    mov eax,1
jmp _endFIbofunc

ELSE3:

mov ebx,[ebp+8] 
dec ebx  ;; n-1


;;  FIRST call
push ebx
call _fibonacci
add esp,4
mov edx,eax

;;  SEC CALL
dec ebx
push ebx
call _fibonacci
add esp,4 
add eax,edx


mov eax,[ebp-4]

_endFIbofunc:

mov esp,ebp
pop ebp
ret

After I ran it on Ubuntu 16.04 it send error:

Segmentation fault (core dumped)

What might be the problem?

score 1 · Answer 1 · answered Oct 29 '17 at 18:34

1

mov eax,[ebp-4]

You are using the memory at [ebp-4] without having put something useful in it! You need to reserve this space in your function prolog:

_fibonacci:
    push ebp
    mov  ebp, esp
    sub  esp, 4

Returning from the 1st recursive call you put the result from EAX in this memory dword.
Returning from the 2nd recursive call you add to EAX the contents of this memory dword.
Doing so, the EDX register will no longer be clobbered.

Why do you use the EBX register at all? If you use it you have to preserve it as was explained in the answer by Al Kepp.
If you start by putting the argument in EAX, you know that for both values below 2 (so 0 and 1), the result is just equal to the argument. Simple.

    mov  eax, [ebp+8] ;; param n 
    cmp  eax, 2
    jb   _endFIbofunc

If you don't balance the stack directly after the 1st recursive call, you can just decrement the dword that is already there and make your second recursive call.

    dec  eax              ; n-1
    push eax              ;(*)
    call _fibonacci
    mov  [ebp-4], eax
    dec  dword ptr [esp]  ; n-2
    call _fibonacci
    add  esp,4            ;(*)
    add  eax, [ebp-4]

The whole proc:

_fibonacci:
    push ebp
    mov  ebp, esp
    sub  esp, 4           ;(*)
    mov  eax, [ebp+8] ;; param n 
    cmp  eax, 2
    jb   _endFIbofunc        
    dec  eax              ; n-1
    push eax              ;(*)
    call _fibonacci
    mov  [ebp-4], eax
    dec  dword ptr [esp]  ;(*) n-2
    call _fibonacci
    add  esp,4            ;(*)
    add  eax, [ebp-4]
_endFIbofunc:
    mov  esp, ebp
    pop  ebp
    ret

answered Oct 29 '17 at 18:34

Sep Roland

33,889
7
43
76

Thank you for improving my proc but: in line that you wrote : dec dword ptr [esp] ;(*) n-2 . nasm yell : error: comma, colon, decorator or end of line expected after operand – Or Shemesh Oct 29 '17 at 19:21
@OrShemesh: NASM syntax just uses `dword`. The `dword ptr` syntax is for MASM. – Peter Cordes Oct 30 '17 at 05:13
Note that in most calling conventions, functions "own" their args on the stack. You've chosen to have `_fibonacci` leave the arg unmodified and have the caller reuse it after the function returns. That works, but I think you could save the `sub esp,4` if you use your own arg as a spill slot. Also, it's a bit confusing to use an `[esp]` addressing mode when you have `ebp`. I guess you're doing that instead of an EBP-relative mode to show that it's the arg to the next function call. – Peter Cordes Oct 30 '17 at 05:23
what would happen if an interrupt (or context switch) that used the user stack occurred between `add esp,4` and `add eax, [ebp-4]` ? Why have `add esp,4` at all since this is handled later on by `mov esp,ebp`? Unrelated - the function could optionally be changed to not use ebp at all. – rcgldr Oct 31 '17 at 00:47
@rcgldr: `add esp,4` is cleaning the `push eax` off the stack. It's `[ebp-4]` is still above `esp` (it *is* `[esp]`). You're right that it's pointless if using a stack frame. (Agreed that using `ebp` at all is more trouble than it's worth, except for beginners. Saving/restoring `ebx` and using a call-preserved register might be better than spilling to a stack slot. That's what a compiler would do. (if it didn't turn recursion into a loop)) – Peter Cordes Oct 31 '17 at 04:34
@PeterCordes - missed an update to my comment. I determined that the `add esp,4` didn't present a potential issue, but forgot to delete that part of my comment when noting only that it wasn't needed since it gets restored with `mov esp,ebp`. Thanks for catching that. – rcgldr Oct 31 '17 at 05:52

score 0 · Answer 2 · answered Oct 29 '17 at 15:27

0

You must store (push) the registers you are going to change in the recursive call. And then restore their original values (pop). That should fix the problem.

Something like this:

Push all registers you are going to use in your function (except eax where you will get return value)
Push ebx as that is your parameter
Call the function
Add esp,4
Pop all registers you pushed in the first step, now in the reverse order

answered Oct 29 '17 at 15:27

Al Kepp

5,831
2
28
48

*"That should fix the problem."* I'm afraid that's a bit too optimistical. The OP is using uninitialized memory to get the function result. – Sep Roland Oct 29 '17 at 18:40

rcgldr · Answer 3 · 2017-10-31T03:15:57.587

0

In addition to the other answers provided, here's an alternate solution:

_fibonacci:
        mov     eax,[esp+4]             ;eax = n
        cmp     eax,2                   ;br if n < 2
        jb      _endFIbofunc
        dec     eax                     ;push n-1
        push    eax
        call    _fibonacci              ;returns eax = fib(n-1)
        xchg    eax,[esp]               ;eax = n-1, [esp] = fib(n-1)
        dec     eax                     ;push n-2
        push    eax
        call    _fibonacci              ;returns eax = fib(n-2)
        add     eax,[esp+4]             ;eax = fib(n-1)+fib(n-2)
        add     esp,8
_endFIbofunc:
        ret

Trivia - fib(47) is the largest < 2^32. The number of recursive calls = 2*fib(n+1)-1.

 n     fib(n)      # calls

 0          0            1
 1          1            1
 2          1            3
 3          2            5
 4          3            9
 5          5           15
 6          8           25
 7         13           41
 8         21           67
 9         34          109
10         55          177
11         89          287
12        144          465
13        233          753
14        377         1219
15        610         1973
16        987         3193
17       1597         5167
18       2584         8361
19       4181        13529
20       6765        21891
21      10946        35421
22      17711        57313
23      28657        92735
24      46368       150049
25      75025       242785
26     121393       392835
27     196418       635621
28     317811      1028457
29     514229      1664079
30     832040      2692537
31    1346269      4356617
32    2178309      7049155
33    3524578     11405773
34    5702887     18454929
35    9227465     29860703
36   14930352     48315633
37   24157817     78176337
38   39088169    126491971
39   63245986    204668309
40  102334155    331160281
41  165580141    535828591
42  267914296    866988873
43  433494437   1402817465
44  701408733   2269806339
45 1134903170   3672623805
46 1836311903   5942430145
47 2971215073   9615053951
48 4807526976   ...

edited Oct 31 '17 at 03:15

answered Oct 31 '17 at 01:30

rcgldr

27,407
3
36
61

1

`xchg eax,[esp]` has an implied `lock` prefix. It's good for code size, but extremely bad for performance. Load into ecx, then store eax back to that location. – Peter Cordes Oct 31 '17 at 04:26
The OP calls `printf` in NASM on `[tag:linux]`, and their `main` doesn't use ecx / edx, so you should assume that the i386 SysV ABI is fine. (Although the recursive calls are "private" so it's ok not to maintain 16B stack alignment.) – Peter Cordes Oct 31 '17 at 05:57
xor-swap with memory? Hmm, yeah I think that would be faster, especially with 2 loads and one memory-dest it's only one store-forwarding round trip so it can benefit from the store buffer instead of having to flush it and access cache directly. Doing two memory-dest `xor`s and one memory source would have the register result ready after the 2nd instruction, but be worse for throughput (and still makes it wait for a store-forwarding round trip.) – Peter Cordes Oct 31 '17 at 06:07
Recursive Fibonacci is just dumb. Simple and efficient for beginners is simple iterative `add eax, edx` / `add edx, eax`. (For future readers, see rcgldr's and my answers on https://stackoverflow.com/questions/32659715/assembly-language-x86-how-to-create-a-loop-to-calculate-fibonacci-sequence for efficient loops, and for a Lucas sequence that gets Fib(n) in less than O(log2(n)) time, but with a higher constant). A better recursive function would be Ackermann's, which needs some kind of stack because it grows so fast. (But still fits in 32 bits for small inputs) – Peter Cordes Oct 31 '17 at 06:28
@PeterCordes - ran out of space in last comment due to links. So for 32 bit unsigned integer fib(47) is the max, taking 9+ billion calls via recursion or 6 loops with matrix or lucas sequence, and for 64 bit unsigned integers, fib(93) is the max, taking 3.948 x 10^19 calls via recursion or 7 loops with matrix or lucas sequence. – rcgldr Oct 31 '17 at 16:09
Note that the constant factor is significant. `fib(47)` only takes 47 clock cycles for the simple non-recursive `add eax, edx` / `add edx,eax` loop. Lucas sequences need to run faster than ~7.8 cycles per iteration to win for 47/6. If `n` is a power of 2 then probably yes, but a couple branch mispredicts would hurt it if using your implementation https://godbolt.org/g/TCu3uh with gcc/clang. (And the more `1` bits in `n`, the more work it has to do.) For small `n`, especially with 32-bit results, it's not a big win even when it is a win. Some CPUs have slower 64-bit imul, hurting Lucas. – Peter Cordes Oct 31 '17 at 16:29
You're highly likely to get a branch mispredict from a computed jump. So it's no better than a normal loop with a mispredict on the loop-exit branch. (Even Silvermont should run an `add, add, / sub ecx,1 / jnz` loop at 1 iteration per 2 clocks (i.e. 1 add per clock)). In fact on most CPUs your idea is *worse*. With a normal loop, the mispredict comes at the end, when a big `add` dep chain is already in the pipe, so out-of-order execution can keep chewing on it while recovering from the branch miss leaving the loop. – Peter Cordes Oct 31 '17 at 16:55
(Mainstream CPUs can issue the 3 or 4 uops loop (depending on sub/jcc macro-fusion: SnB-family only) at 1 iter per clock, so the front-end can get ahead of execution. (The loop overhead can run in parallel with the Fib dep chain, and will get to the end of the loop twice as fast, keeping up with the front-end.) Also, a loop doesn't even need `n` to be ready to make significant progress in calculating Fib(n). The loop-branch will predict strongly-taken unless `n` is often very small, so the Fib calculation can be running while OOO exec is working through the dep chain for the `n` input. – Peter Cordes Oct 31 '17 at 16:59
Of course that last bit only works if you write the function not to have a data dependency on `n` for the starting values of the sequence. i.e. start with `0` and `1` constants (instead of `n&1` and `1`) and sort out the last odd iteration at the end (branchless or not). I hadn't thought about this before when writing loops; I wonder if it helps much. I guess maybe if the count is often calculated in the caller (or loaded with a cache-miss), and branch-prediction gets you to the loop code before the counter is ready but the other inputs are. Oh right, less likely for loops with ptr inputs. – Peter Cordes Oct 31 '17 at 17:06
Yeah, that's exactly what I was picturing. It's not faster than a loop, because the loop overhead can run in parallel with the 1 `add` per clock dep chain. And it has the downside of putting the typical branch mispredict *before* any of the `add` dep chain is in the pipe being executed. Also it has much larger code size. The only way this can be good if called with exactly the same `n` every time, or a simple repeating pattern on CPUs that can predict that (not low-power Silvermont). And even then it's not better than a loop in any significant way, if you write the loop efficiently. – Peter Cordes Oct 31 '17 at 17:38
Yes, potentially useful on CPUs that aren't superscalar, like i486. I think even in-order P5 Pentium could run the loop fine, as long as you reorder it so add/sub and lea/jnz can pair. (Use `lea` instead of one of the `add`s so you don't clobber flags and can pair the loop-counter `sub` with the first `add`). Actually I'm not sure if a taken-branch caused any kind of bubble on P5. 2-clock loops are definitely fine on P6, and most (all?) AMD CPUs. – Peter Cordes Oct 31 '17 at 17:52
Yes, if it predicts correctly. But I'm picturing a use-case where you call this function with different `n` most times, leading to a different branch target every time. Mainstream x86 CPUs (not low power Silvermont) can predict patterns for indirect branches. IDK if it's as powerful as a the taken/not-taken prediction based on global or local history for conditional branches. But even that wouldn't predict very well if you're not calling with the same `n` most times. (And why would you? Save `Fib(n)` instead of recomputing, so you only get patterns when the program input has patterns.) – Peter Cordes Oct 31 '17 at 18:03
1

Prediction is needed in two places: fetch-block prediction (which just has to predict the next block to fetch, given that we just fetched this one). But also the exact target address. RIP is updated speculatively based on the prediction, and obviously that has to point to exactly the right instruction; the only mechanism for correcting the mistake when the branch actually executes to verify the prediction is a roll-back to the known-good state before the branch, not by trying to adjust the architectural state to account for a missed or wrongly-executed instruction. – Peter Cordes Oct 31 '17 at 23:01
@PeterCordes - I posted a question about this in Computer Scient [here](https://cs.stackexchange.com/questions/83271/indexed-branch-overhead-on-x86-64-bit-mode) . I've deleted my prior comments. Your comment about xchg should remain here, since it is relevant. – rcgldr Nov 01 '17 at 02:59
For xchg with memory, I added a section about that to https://stackoverflow.com/questions/26469196/swapping-2-registers-in-8086-assembly-language16-bits/47021804#47021804 (the main answer is about exchanging registers) – Peter Cordes Nov 01 '17 at 03:26

Recursive fibonacci Assembly

3 Answers3

Linked