8

My programming language compiles to C, I want to implement tail recursion optimization. The question here is how to pass control to another function without "returning" from the current function.

It is quite easy if the control is passed to the same function:

void f() {
    __begin:
    do something here...
    goto __begin; // "call" itself
}

As you can see there is no return value and no parameters, those are passed in a separate stack adressed by a global variable.

Another option is to use inline assembly:

#ifdef __clang__
    #define tail_call(func_name) asm("jmp " func_name " + 8");
#else
    #define tail_call(func_name) asm("jmp " func_name " + 4");
#endif

void f() {
    __begin:
    do something here...
    tail_call(f); // "call" itself
}

This is similar to goto but as goto passes control to the first statement in a function, skipping the "entry code" generated by a compiler, jmp is different, it's argument is a function pointer, and you need to add 4 or 8 bytes to skip the entry code.

The both above will work but only if the callee and the caller use the same amount of stack for local variables which is allocated by the entry code of the callee.

I was thinking to do leave manually with inline assembly, then replace the return address on the stack, then do a legal function call like f(). But my attempts all crashed. You need to modify BP and SP somehow.

So again, how to implement this for x64? (Again, assuming functions have no arguments and return void). Portable way without inline assembly is better, but assembly is accepted. Maybe longjump can be used?

Maybe you can even push the callee address on the stack, replacing the original return address and just ret?

exebook
  • 32,014
  • 33
  • 141
  • 226
  • Can't you let the C compiler do the optimisation? – Paul Floyd Feb 06 '18 at 15:41
  • @PaulFloyd this will not work with TinyC which is a must for me. Also you need -O2 which is not always on for me. – exebook Feb 06 '18 at 15:45
  • @exebook that restriction is _not_ stated in your question. – davmac Feb 06 '18 at 15:59
  • "The both above will work but only if the callee and the caller use the same amount of stack for local variables which is allocated by the entry code of the callee." For recursive calls this should be the case. Why is that different in your approach? – UniversE Feb 06 '18 at 16:02
  • @UniversE, because sometimes function a() calls b() which in turn calls a() again, we got a recursion, but it is not direct and we need to take care about stack allocated variable space. – exebook Feb 06 '18 at 16:06
  • @exebook in those cases, `b()` is certainly inlineable (otherwise, you don't get an optimizable tail recursion). So try to inline `b()` in `a()` first (including the stack layout), and then perform your usual tricks. *edit*: of course your compiler must check this property / must not blindly inline the code. – UniversE Feb 06 '18 at 16:37
  • Do you imply that you optimization only needs to work for callees with no arguments? Because otherwise using asm(jmp) is not viable. – Uprooted Feb 06 '18 at 18:13
  • @user7231 yes, I maintain another stack for arguments, globally defined as `var * var_stack[MAX_SIZE]` – exebook Feb 06 '18 at 18:43
  • 1
    @exebook, missed that in text, sorry. Your idea of doing leave seems impractical to me, because on some compilers leave (and/or enter) can be omitted on higher optimization levels, or functions can be inlined, so there's no actual frame, etc. I've linked to some vm guy question in my answer, though I can't vouch for the accepted solution there, it seems to me you're heading the same way. You already have separate arg stack. – Uprooted Feb 06 '18 at 20:28

4 Answers4

5

Do not try to do this yourself. A good C compiler can perform tail-call elimination in many cases and will do so. In contrast, a hack using inline assembly has a good chance of going wrong in a way that is difficult to debug.

For example, see this snippet on godbolt.org. To duplicate it here:

The C code I used was:

int foo(int n, int o)
{
    if (n == 0) return o;
    puts("***\n");
    return foo(n - 1, o + 1);
}

This compiles to:

.LC0:
  .string "***\n"
foo:
  test edi, edi
  je .L4
  push r12
  mov r12d, edi
  push rbp
  mov ebp, esi
  push rbx
  mov ebx, edi
.L3:
  mov edi, OFFSET FLAT:.LC0
  call puts
  sub ebx, 1
  jne .L3
  lea eax, [r12+rbp]
  pop rbx
  pop rbp
  pop r12
  ret
.L4:
  mov eax, esi
  ret

Notice that the tail call has been eliminated. The only call is to puts.

davmac
  • 20,150
  • 1
  • 40
  • 68
1

Since you don't need arguments and return values, how about combining all function into one and use labels instead of function names?

f:
    __begin:
    ...
    CALL(h); // a macro implementing traditional call
    ...
    if (condition_ret)
       RETURN; // a macro implementing traditional return
    ...
    goto g; // tail recurse to g

The tricky part here is RETURN and CALL macros. To return you should keep yet another stack, a stack of setjump buffers, so when you return you call longjump(ret_stack.pop()), and when you call you do ret_stack.push(setjump(f)). This is poetical rendition ofc, you'll need to fill out the details.

gcc can offer some optimization here with computed goto, they are more lightweight than longjump. Also people who write vms have similar problems, and seemingly have asm-based solutions for those even on MSVC, see example here.

And finally such approach even if it saves memory, may be confusing to compiler, so can cause performance anomalies. You probably better off generating for some portable assembler-like language, llvm maybe? Not sure, should be something that has computed goto.

Uprooted
  • 941
  • 8
  • 21
1

The venerable approach to this problem is to use trampolines. Essentially, every compiled function returns a function pointer (and maybe an arg count). The top level is a tight loop that, starting with your main, simply calls the returned function pointer ad infinitum. You could use a function that longjmps to escape the loop, i.e., to terminate the progam.

See this SO Q&A. Or Google "recursion tco trampoline."

For another approach, see Cheney on the MTA, where the stack just grows until it's full, which triggers a GC. This works once the program is converted to continuation passing style (CPS) since in that style, functions never return; so, after the GC, the stack is all garbage, and can be reused.

Doug Currie
  • 40,708
  • 1
  • 95
  • 119
0

I will suggest a hack. The x86 call instruction, which is used by the compiler to translate your function calls, pushes the return address on the stack and then performs a jump.

What you can do is a bit of a stack manipulation, using some inline assembly and possibly some macros to save yourself a bit of headache. You basically have to overwrite the return address on the stack, which you can do immediately in the function called. You can have a wrapper function which overwrites the return address and calls your function - the control flow will then return to the wrapper which then moves to wherever you pointed it to.

Paul92
  • 8,827
  • 1
  • 23
  • 37