Implementing an overflow-free system stack in C90

Question

I was just reading about how Google Go makes each thread with a reduced stack size by default, and then links to new stacks if an overflow would occur ( see page 16 in here ). I was wondering the best way to do that in C.

I have to say I am no C expert, so there might be a better way to detect a Stack Overflow on C, but giving my ignorance, here's how I thought I would implement it:

The first thing I thought is that every time we have a fresh new stack, we get a stack variable's address, and with that we have roughly the starting stack address. Then we would need to be able to retrieve how much stack space the thread has. This would be possible if the thread isn't the main thread, but I have no idea how we would get this info on C.

Then we would need to check (per function call, it could be) how much of the stack is already used, by retrieving current stack variable address. If we detect a possible stack overflow, we would need to have some way to create a new stack and link to the last one. The only way I thought it could be done in C would be to create a new thread to execute the function we want, and lock the current thread until the function returns its result.

So, would there be a cleaner/better way to implement this?

I'm not sure I like your anti-overflow attitude. Are you sure you're not on the wrong site ;) — Chris Eberle, Apr 14 '11 at 03:10
Note that there's nothing overflow-free about split stack. `malloc` (or whatever method is used to allocate new stack fragments) can still fail, and there's no way for the application to detect and handle this. — R.. GitHub STOP HELPING ICE, Apr 15 '11 at 11:25

Ira Baxter · Accepted Answer · 2011-04-14T05:10:31.947

9

See GCC's split-stack capability. I believe this was originally implemented to support Go. It pretty much works as you suggest.

EDIT: The commentary below discusses another system that does heap-allocation of activation records.

edited Apr 14 '11 at 05:10

answered Apr 14 '11 at 03:12

Ira Baxter

93,541
22
172
341

That's a great link! Thank you! I am reading it right now. The only question, for me, that lies is how to determine the current stack size - preferrably in ANSI C – Waneck Apr 14 '11 at 03:24
1

I don't think you "determine" the stack size. What you do at each [point of stack allocation is determine how much stack you want to allocate (I think the split-stack scheme uses 1 VM page) and store that size somewhere known (e.g, at the bottom of the stack if it is a power of two size). The split-stack scheme, if it always allocates a *fixed* size chunk, simply *knows* how big each segment is. – Ira Baxter Apr 14 '11 at 03:26
1

The reason I know about this link is that I have built a non-C language (PARLANSE) that implements heap-allocated stack fragments, so I've long been interested in such schemes. PARLANSE allocates a new fragment for *every* function call, at about the price, I think, as the GCC method. The overhead is about 3-5% based on our measurements, but our language and coding style tends to encourage big functions (a good C compiler will do that, too, by inlining all the the little ones). See www.semanticdesigns.com/Products/PARLANSE – Ira Baxter Apr 14 '11 at 03:29
Wow! That's great info!!! I am also looking for it when thinking about possible optimizations when compiling a language to ANSI C. Are you using the "create new thread, and set a lock on the current stack" approach? – Waneck Apr 14 '11 at 03:31
1

No, PARLANSE is far crazier than that. It is designed to allow *millions* of computational grains all running in parallel, which you can't do with the "big stack model" (see http://stackoverflow.com/questions/1016218/how-does-a-stackless-language-work/1053159#1053159). You want to do this if each grain can fork parallel sub-work, and the forked grains can interlock over resources. So PARLANSE multiplexes 1 OS thread per CPU onto potentially millions of running "grains". In practice, it tries to keep the grain count to a number much bigger than threads, and that's enough to be effective. – Ira Baxter Apr 14 '11 at 03:35
just read again and saw that you're implementing heap-allocated stack fragments! This overhead you're talking about is the speed difference or memory overhead? – Waneck Apr 14 '11 at 03:36
1

Where do you get enough work to justify this? We use PARLANSE to process complete systems of source code (see www.semanticdesigns.com/Products/DMS/DMSToolkit.html), which requires huge data structures and lots of analysis. Better parallel for this than not. PARLANSE isn't perfect for this job, but its better than anything else we've seen :-{ – Ira Baxter Apr 14 '11 at 03:37
1

The overhead is the speed difference. The memory overhead is roughly 25% because we are using buddy-system power of two sized blocks. Each function knows how much stack space it (and its parallel sub-grains) need, and asks for the next power of two size up. The real trick is in having bunches of available blocks that don't need locks to allocate :-} We use a lot of thread local storage to make this efficient. – Ira Baxter Apr 14 '11 at 03:40
1

What this really means for us in practice is that we can write recursive algorithms over huge structures (mostly ASTs or spanning treess on graphs) and not run out of stack space, or worry about running out of stack space. We've had recursions go several million deep! While you might do this with one thread, imagine it happening to one of many random grains and you don't know which one in advance. You pretty much have to allocate stacks in small fragments. – Ira Baxter Apr 14 '11 at 03:42
wow, it all sounds fantastic! Thank you for the links, and ++1 for all the info ! I don't think I will implement this with a heap-allocated stack, as it would be very hard to do it on ANSI C, but will definately keep this in mind. I never thought the overhead was so little! – Waneck Apr 14 '11 at 03:45
maybe you could always pass the stack pointer as an argument for the next function call, this way you could maybe link them as split-stack does, and have a better memory use? – Waneck Apr 14 '11 at 03:50
We do pass the stack pointer "in a register" :-} to the next function call, which saves it so it can restore that SP on a return. Yes, we could have implemented a scheme like split-stacks, but then every grain would need 4Kb for stack space, times 1 million grains --> 4Gb of VM gone in just stacks. With our scheme, we can theoretically get by with 1 million grains with only a cache line of stack space per grain --> 32Mb or 3% of the virtual space. (In practice, Windows stupid tricks makes our stack frames bigger but they're still a lot smaller than 4Kb on average.) – Ira Baxter Apr 14 '11 at 03:55
PARLANSE was also designed 15 years ago, when you could hardly get 128Kb of RAM, and 4Gb of VM would page thrash you to death. It would be much better at your "fine grain server" on a small machine. – Ira Baxter Apr 14 '11 at 03:57

score 2 · Answer 2 · answered Apr 14 '11 at 03:15

2

You can do this - I believe modern gcc may even have an option for it - but it greatly increases the cost of function calls and has very little practical benefit. Especially on modern systems with 64-bit addressing, there's plenty of address space for each thread to have its own stack plenty far from every other thread's stack. If you find yourself using more than logarithmic-scale call recursion, something is wrong with your algorithms anyway...

answered Apr 14 '11 at 03:15

R.. GitHub STOP HELPING ICE

208,859
35
376
711

what do you mean by "greatly"? As far as I see it it would need at most a new temporary variable push, and a compare. Well, when you think of e.g. a webserver, having a small stack space for each thread is something very important ! – Waneck Apr 14 '11 at 03:23
1

It doesn't greatly increase the cost of function calls. It only takes a few insructions more than that standard call built into the CPU, and most interesting functions are hundreds of instructions long. (Shorter functions tend to disappear if the compiler is good at inlining). At ;east that's my personal experience; se comments on my answer. – Ira Baxter Apr 14 '11 at 03:31
1

For a webserver you can already have a small stack for each thread. Just use `pthread_attr_setstacksize(&attr, getconf(_SC_THREAD_STACK_MIN));` and don't write idiotic recursive code. – R.. GitHub STOP HELPING ICE Apr 14 '11 at 16:14
As for function size, well-factored code uses small, simple functions, and with shared libraries and opaque types with accessor functions, there is no way the compiler can optimize away the calls. Even without additional ugly waste like split-stack, I'm already encountering situations where gcc's call overhead (stuff like stack alignment and useless register saving in cases that don't need it) is incurring 10-30% of the total runtime of a function. This would easily jump to 60-90% with split-stack. – R.. GitHub STOP HELPING ICE Apr 14 '11 at 16:15
Also note that, with good code, a compiler can easily perform static analysis to determine the exact stack size needed for every thread start function. Only code that's using recursion or arbitrary callback functions defies establishing compile-time bounds on the stack space needed. – R.. GitHub STOP HELPING ICE Apr 14 '11 at 16:19
@R. That doesn't work with separate compilation or dynamic loaders. – Ira Baxter Apr 15 '11 at 04:25
It can - the [dynamic] linker would just have to put together some information generated by the compiler. – R.. GitHub STOP HELPING ICE Apr 15 '11 at 11:23
@R: I don't see how to do this without a full garbage collector. If you compute a new stack size, then you need to move the stack, and relocate all the pointers that might be in old stack. – Ira Baxter Apr 15 '11 at 13:25
I never said anything about computing a new stack size at runtime. This is all static analysis. In regards to dynamic linking I was talking about normal dynamic linking, not `dlsym`, which is of course the same as callbacks which I already acknowledged preclude this kind of analysis. – R.. GitHub STOP HELPING ICE Apr 15 '11 at 13:37

Implementing an overflow-free system stack in C90

2 Answers2