Forcing tail call elimination in C with inline asm

Question

I have a small programming language I've been designing, and I want a relatively easy backend option for it. C stands out as a relatively decent option for all the usual reasons (such as "I can't follow good advice" and "I don't know assembly"), plus more importantly the fact that there's not very much in my language that can't easily be expressed in C (it's more about the type checker, no GC or heavy-duty runtime to accommodate) - except one thing: optimised tail calls.

It's not a deal breaker: I don't really need it, as the language has looping constructs, but it's supposed to be functional too, so TCO seems like a valuable thing to put in if possible. Trampoline or lazy-thunk solutions would work, but appear (from the literature, I have not profiled yet, and anyway don't have any usage information) to have a noticeable effect on performance. Don't want the implementation to discourage users from expressing a problem in a valid way.

This question leaves me thinking it ought to be possible to create a relatively simple platform-specific "tailcall" block, to replace "return" and force the behaviour that I want. But, not being familiar enough with x86, I don't know how to control C's automatic stack manipulation, where I ought to be aiming the jump, etc. (It looks to me like it extends the stack after the start of a function... so I need to either shrink the stack before the tailcall, which would be a bit unnecessary, or somehow aim a jump at the start of the target's body after the stack manipulation, which sounds pretty much impossible edit: for a statically-unknown target ..?)

So:

can this be done?
can it be done safely?
is it at all sensible?

I should repeat that I am aware that it is possible to implement tail call elimination that works in C (that is, of my high-level language, not of C), I just want to add a compiler-and-platform-specific optimisation, to help it be be more or less indistinguishable from the "real thing". And I mainly want to use C because I'm very confident with it, whereas I've never used LLVM and have only the barest reading-knowledge of x86, so I was hoping to skip that step and get output working. I know I will have to learn both properly in the future at some point, but ...

(i.e. I'm asking this because the idea is stuck in my head and it bugs me that there might be a Better Way To Do It(tm), not because I need the solution to get a working backend.)

This reminds me of [Continuation Passing Style](http://en.wikipedia.org/wiki/Continuation-passing_style) and, specifically, [Chicken](http://en.wikipedia.org/wiki/Chicken_%28Scheme_implementation%29), a Scheme to C compiler. You may want to study those. — Alexey Frunze, Sep 30 '12 at 15:19
I don't see why you would need inline assembly for this. You could emit C code that had a label at the beginning of the function, then you would replace the tail call with a rebinding of the function parameters and a goto to the label. — Ferruccio, Sep 30 '12 at 15:22
TCE is just changing a recursive tail call into a `while(1) { body }` statement, while keeping track of the arguments before 'going again'. Edit: Basically what @Ferruccio said ;p — leppie, Sep 30 '12 at 15:26
I would already do that for recursion, sure, but I mean the general case of a tail-call to a statically-unknown function pointer. Unless I misunderstand you totally, the `goto`/`while` solution wouldn't help with that. — Alex Celeste, Sep 30 '12 at 15:33
Better check the code GCC generates with some optimization, it is quite good at eliminating tail recursion (and even some non-tail recursions). — vonbrand, Jan 31 '13 at 13:38

score 1 · Accepted Answer · edited May 23 '17 at 11:43

1

can this be done?

Yes.

can it be done safely?

Yes.

is it at all sensible?

No. gcc already does TCO for you and additionally supports computed goto, so there's absolutely no reason to do this.

edited May 23 '17 at 11:43

Community

1
1

answered Sep 30 '12 at 15:44

geocar

9,085
1
29
37

The TCO question you linked has little to do with this. It discusses finding out whether gcc has performed TCO on a specific function, by dumping the assembler output. – jonvuri Sep 30 '12 at 16:29
@Kiyura - it illustrates that GCC performs TCO; you do not need to resort to assembly to get it on GCC. – geocar Sep 30 '12 at 17:31
But it's an optimization that, by nature, is not guaranteed. Even if it was, it doesn't apply in this case, since the function being called is not necessarily known at compile time. – jonvuri Sep 30 '12 at 18:00
@Kiyura - Where *exactly* in the post did he say "the function is not necessarily known at compile time"? – geocar Sep 30 '12 at 18:02
That just says function pointer: There's no reason TCO doesn't work with function pointers. – geocar Sep 30 '12 at 18:06
I would like to be able to guarantee the optimisation if I say it will be provided, and I would like to be able to guarantee that it will work with *any* function, not necessarily knowable at compile-time, and not necessarily a sibling-call. Even so GCC might provide a suitable solution by itself with appropriate fiddling. The computed gotos suggestion is also interesting, I forgot about those entirely. They don't work from out of scope, though. – Alex Celeste Sep 30 '12 at 19:08
Leushenko: It might be more productive if you post an example of code that you think should be TCO that isn't. – geocar Sep 30 '12 at 19:27
This, for instance: `int inner(int x, int a, int b, int c) { return x + a + b + c; } int(* extfunc)(int, int, int, int) = inner; int wrapper(int a, int b, int c) { return extfunc(42, a, b, c); }` ...doesn't turn into a jump even with -O3, because the compiler can't predict the body of extfunc and it doesn't have a sibling-signature. – Alex Celeste Sep 30 '12 at 19:39
It most certainly does: `jmp *%rax` – geocar Sep 30 '12 at 19:47
Not on my machine, but if it does for you then it does *somewhere*, and... ehh, I guess that's good enough for now, given that I still have the slow-fallback. OK, accepted. – Alex Celeste Sep 30 '12 at 20:14
@Leushenko: What gcc rev are you using ? I'm seeing the `jmpq *%...` for gcc 3.4 onwards (uses `%r11` till 4.2, since then `%rax`, as do all version of clang/llvm since at least 2.9), but gcc 3.2 and below are unable to optimize the `call` away. What compiler rev are you tied to ? – FrankH. Oct 01 '12 at 12:08
4.6.3, 32-bit only at the moment (is the 32-bit part important? Wasn't expecting it to be). – Alex Celeste Oct 01 '12 at 13:19
@Leushenko: 32-bit _is_ important. The fact that argument passing in 32bit happens _on the stack_ means that `wrapper()` in your example _must have a stackframe_ to do the arg reordering, and therefore _cannot be tail-eliminated_ ... – FrankH. Oct 04 '12 at 12:59
2

@Leushenko: The key question for whether TCO can be applied to a given function is really whether that function needs a stackframe. That means, as soon as arguments are passed on the stack different from the original args given to the wrapper itself, the `call` _won't_ be elminated. In 32bit x86, if the number of args to the wrapped func is different from that to the wrapper, you always have that situation (as args are passed on stack). In 64bit x86_64, this occurs if the wrapped function has more args than can fit in argument registers. _Not every_ tail-returning function is eligible for TCO. – FrankH. Oct 04 '12 at 13:09

Forcing tail call elimination in C with inline asm

1 Answers1