In practice, if you can't set a hard upper bound on possible size of less than 1kiB or so, you should normally just dynamically allocate. If you can be sure the size is that small, you could consider using alloca
as the container for your stack.
(You can't conditionally use a VLA, it has to be in-scope. Although you could have its size be zero by declaring it after an if()
, and set a pointer variable to the VLA address, or to malloc
. But alloca would be easier.)
In C++ you'd normally std::vector
, but it's dumb because it can't / doesn't use realloc
(Does std::vector *have* to move objects when growing capacity? Or, can allocators "reallocate"?). So in C++ it's a tradeoff between more efficient growth vs. reinventing the wheel, although it's still amortized O(1) time. You can mitigate most of it with a fairly large reserve()
up front, because memory you alloc but never touch usually doesn't cost anything.
In C you have to write your own stack anyway, and realloc
is available. (And all C types are trivially copyable, so there's nothing stopping you from using realloc). So when you do need to grow, you can realloc the storage. But if you can't set a reasonable and definitely-large-enough upper bound on function entry and might need to grow, then you should still track capacity vs. in-use size separately, like std::vector. Don't call realloc
on every push/pop.
Using the callstack directly as a stack data structure is easy in pure assembly language (for ISAs and ABIs that use a callstack i.e. "normal" CPUs like x86, ARM, MIPS, etc). And yes, in asm worth doing for stack data structures that you know will be very small, and not worth the overhead of malloc
/ free
.
Use asm push
or pop
instructions (or equivalent sequence for ISAs without a single-instruction push / pop). You can even check the size / see if the stack data structure is empty by comparing against a saved stack-pointer value. (Or just maintain an integer counter along side your push/pops).
A very simple example is the inefficient way some people write int->string functions. For non-power-of-2 bases like 10, you generate digits in least-significant first order by dividing by 10 to remove them one at a time, with digit = remainder. You could just store into a buffer and decrement a pointer, but some people write functions that push
in the divide loop and then pop
in a second loop to get them in printing order (most-significant first). e.g. Ira's answer on How do I print an integer in Assembly Level Programming without printf from the c library? (My answer on the same question shows the efficient way which is also simpler once you grok it.)
It doesn't particularly matter that the stack grows towards the heap, just that there is some space you can use. And that stack memory is already mapped, and normally hot in cache. This is why we might want to use it.
Stack above heap happens to be true under GNU/Linux for example, which normally puts the main thread's user-space stack near the top of user-space virtual address space. (e.g. 0x7fff...
) Normally there's a stack-growth limit that's much smaller than the distance from stack to heap. You want an accidental infinite recursion to fault early, like after consuming 8MiB of stack space, not drive the system to swapping as it uses gigabytes of stack. Depending on the OS, you can increase the stack limit, e.g. ulimit -s
. And thread stacks are normally allocated with mmap
, the same as other dynamic allocation, so there's no telling where they'll be relative to other dynamic allocation.
AFAIK it's impossible from C, even with inline asm
(Not safely, anyway. An example below shows just how evil you'd have to get to write this in C the way you would in asm. It basically proves that modern C is not a portable assembly language.)
You can't just wrap push
and pop
in GNU C inline asm statements because there's no way to tell the compiler you're modifying the stack pointer. It might try to reference other local variables relative to the stack pointer after your inline asm statement changed it.
Possibly if you knew you could safely force the compiler to create a frame pointer for that function (which it would use for all local-variable access) you could get away with modifying the stack pointer. But if you want to make function calls, many modern ABIs require the stack pointer to be over-aligned before a call. e.g. x86-64 System V requires 16-byte stack alignment before a call
, but push
/pop
work in units of 8 bytes. OTOH, 32-bit ARM (and some 32-bit x86 calling conventions, e.g. Windows) don't have that feature so any number of 4-byte pushes would leave the stack correctly aligned for a function call.
I wouldn't recommend it, though; if you want that level of optimization (and you know how to optimize asm for the target CPU), it's probably safer to write your whole function in asm.
Variable Length Arrays and I don't really understand why they couldn't be leveraged as a way to grow stack reference
VLAs aren't resizable. After you do int VLA[n];
you're stuck with that size. Nothing you can do in C will guarantee you more memory that's contiguous with that array.
Same problem with alloca(size)
. It's a special compiler built-in function that (on a "normal" implementation) decrements the stack pointer by size
bytes (rounded to a multiple of the stack width) and returns that pointer. In practice you can make multiple alloca
calls and they will very likely be contiguous, but there's zero guarantee of that so you can't use it safely without UB. Still, you might get away with this on some implementations, at least for now until future optimizations notice the UB and assume that your code can't be reachable.
(And it could break on some calling conventions like x86-64 System V where VLAs are guaranteed to be 16-byte aligned. An 8-byte alloca
there probably rounds up to 16.)
But if you did want to make this work, you'd maybe use long *base_of_stack = alloca(sizeof(long));
(the highest address: stacks grow downward on most but not all ISAs / ABIs - this is another assumption you'd have to make).
Another problem is that there's no way to free alloca
memory except by leaving the function scope. So your pop
has to increment some top_of_stack C pointer variable, not actually moving the real architectural "stack pointer" register. And push
will have to see whether the top_of_stack
is above or below the high-water mark which you also maintain separately. If so you alloca
some more memory.
At that point you might as well alloca
in chunks larger than sizeof(long)
so the normal case is that you don't need to alloc more memory, just move the C variable top-of-stack pointer. e.g. chunks of 128 bytes maybe. This also solves the problem of some ABIs keeping the stack pointer over-aligned. And it lets the stack elements be narrower than the push/pop width without wasting space on padding.
It does mean we end up needing more registers to sort of duplicate the architectural stack pointer (except that the SP never increases on pop).
Notice that this is like std::vector
's push_back
logic, where you have an allocation size and an in-use size. The difference is that std::vector
always copies when it wants more space (because implementations fail to even try to realloc
) so it has to amortize that by growing exponentially. When we know growth is O(1) by just moving the stack pointer, we can use a fixed increment. Like 128 bytes, or maybe half a page would make more sense. We're not touching memory at the bottom of the allocation immediately; I haven't tried compiling this for a target where stack probes are needed to make sure you don't move RSP by more than 1 page without touching intervening pages. MSVC might insert stack probes for this.
Hacked up alloca stack-on-the-callstack: full of UB and miscompiles in practice with gcc/clang
This mostly exists to show how evil it is, and that C is not a portable assembly language. There are things you can do in asm you can't do in C. (Also including efficiently returning multiple values from a function, in different registers, instead of a stupid struct.)
#include <alloca.h>
#include <stdlib.h>
void some_func(char);
// assumptions:
// stack grows down
// alloca is contiguous
// all the UB manages to work like portable assembly language.
// input assumptions: no mismatched { and }
// made up useless algorithm: if('}') total += distance to matching '{'
size_t brace_distance(const char *data)
{
size_t total_distance = 0;
volatile unsigned hidden_from_optimizer = 1;
void *stack_base = alloca(hidden_from_optimizer); // highest address. top == this means empty
// alloca(1) would probably be optimized to just another local var, not necessarily at the bottom of the stack frame. Like char foo[1]
static const int growth_chunk = 128;
size_t *stack_top = stack_base;
size_t *high_water = alloca(growth_chunk);
for (size_t pos = 0; data[pos] != '\0' ; pos++) {
some_func(data[pos]);
if (data[pos] == '{') {
//push_stack(stack, pos);
stack_top--;
if (stack_top < high_water) // UB: optimized away by clang; never allocs more space
high_water = alloca(growth_chunk);
// assert(high_water < stack_top && "stack growth happened somewhere else");
*stack_top = pos;
}
else if(data[pos] == '}')
{
//total_distance += pop_stack(stack);
size_t popped = *stack_top;
stack_top++;
total_distance += pos - popped;
// assert(stack_top <= stack_base)
}
}
return total_distance;
}
Amazingly, this seems to actually compile to asm that looks correct (on Godbolt), with gcc -O1
for x86-64 (but not at higher optimization levels). clang -O1
and gcc -O3
optimize away the if(top<high_water) alloca(128)
pointer compare so this is unusable in practice.
<
pointer comparison of pointers derived from different objects is UB, and it seems even casting to uintptr_t
doesn't make it safe. Or maybe GCC is just optimizing away the alloca(128)
based on the fact that high_water = alloca()
is never dereferenced.
https://godbolt.org/z/ZHULrK shows gcc -O3
output where there's no alloca inside the loop. Fun fact: making volatile int growth_chunk
to hide the constant value from the optimizer makes it not get optimized away. So I'm not sure it's pointer-compare UB that's causing the issue, it's more like accessing memory below the first alloca instead of dereferencing a pointer derived from the second alloca that gets compilers to optimize it away.
# gcc9.2 -O1 -Wall -Wextra
# note that -O1 doesn't include some loop and peephole optimizations, e.g. no xor-zeroing
# but it's still readable, not like -O1 spilling every var to the stack between statements.
brace_distance:
push rbp
mov rbp, rsp # make a stack frame
push r15
push r14
push r13 # save some call-preserved regs for locals
push r12 # that will survive across the function call
push rbx
sub rsp, 24
mov r12, rdi
mov DWORD PTR [rbp-52], 1
mov eax, DWORD PTR [rbp-52]
mov eax, eax
add rax, 23
shr rax, 4
sal rax, 4 # some insane alloca rounding? Why not AND?
sub rsp, rax # alloca(1) moves the stack pointer, RSP, by whatever it rounded up to
lea r13, [rsp+15]
and r13, -16 # stack_base = 16-byte aligned pointer into that allocation.
sub rsp, 144 # alloca(128) reserves 144 bytes? Ok.
lea r14, [rsp+15]
and r14, -16 # and the actual C allocation rounds to %16
movzx edi, BYTE PTR [rdi] # data[0] check before first iteration
test dil, dil
je .L7 # if (empty string) goto return 0
mov ebx, 0 # pos = 0
mov r15d, 0 # total_distance = 0
jmp .L6
.L10:
lea rax, [r13-8] # tmp_top = top-1
cmp rax, r14
jnb .L4 # if(tmp_top < high_water)
sub rsp, 144
lea r14, [rsp+15]
and r14, -16 # high_water = alloca(128) if body
.L4:
mov QWORD PTR [r13-8], rbx # push(pos) - the actual store
mov r13, rax # top = tmp_top completes the --top
# yes this is clunky, hopefully with more optimization gcc would have just done
# sub r13, 8 and used [r13] instead of this RAX tmp
.L5:
add rbx, 1 # loop condition stuff
movzx edi, BYTE PTR [r12+rbx]
test dil, dil
je .L1
.L6: # top of loop body proper, with 8-bit DIL = the non-zero character
movsx edi, dil # unofficial part of the calling convention: sign-extend narrow args
call some_func # some_func(data[pos]
movzx eax, BYTE PTR [r12+rbx] # load data[pos]
cmp al, 123 # compare against braces
je .L10
cmp al, 125
jne .L5 # goto loop condition check if nothing special
# else: it was a '}'
mov rax, QWORD PTR [r13+0]
add r13, 8 # stack_top++ (8 bytes)
add r15, rbx # total += pos
sub r15, rax # total -= popped value
jmp .L5 # goto loop condition.
.L7:
mov r15d, 0
.L1:
mov rax, r15 # return total_distance
lea rsp, [rbp-40] # restore stack pointer to point at saved regs
pop rbx # standard epilogue
pop r12
pop r13
pop r14
pop r15
pop rbp
ret
This is like you'd do for a dynamically allocated stack data structure except:
- it grows downward like the callstack
- we get more memory from
alloca
instead of realloc
. (realloc
can also be efficient if there's free virtual address space after the allocation). C++ chose not to provide a realloc
interface for their allocator, so std::vector
always stupidly allocs + copies when more memory is required. (AFAIK no implementations optimize for the case where new
hasn't been overridden and use a private realloc).
- it's totally unsafe and full of UB, and fails in practice with modern optimizing compilers
- the pages will never get returned to the OS: if you use a large amount of stack space, those pages stay dirty indefinitely.
If you can choose a size that's definitely large enough, you could use a VLA of that size.
I'd recommend starting at the top and going downward, to avoid touching memory far below the currently in-use region of the callstack. That way, on an OS that doesn't need "stack probes" to grow the stack by more than 1 page, you might avoid ever touching memory far below the stack pointer. So the small amount of memory you do end up using in practice might all be within an already mapped page of the callstack, and maybe even cache lines that were already hot if some recent deeper function call already used them.
If you do use the heap, you can minimize realloc costs by doing a pretty large allocation. Unless there was a block on the free-list that you could have gotten with a smaller allocation, in general over-allocating have very low cost if you never touch parts you didn't need, especially if you free or shrink it before doing any more allocations.
i.e. don't memset
it to anything. If you want zeroed memory, use calloc
which may be able to get zeroed pages from the OS for you.
Modern OSes use lazy virtual memory for allocations so the first time you touch a page, it typically has to page-fault and actually get wired into the HW page tables. Also a page of physical memory has to get zeroed to back this virtual page. (Unless the access was a read, then Linux will copy-on-write map the page to a shared physical page of zeros.)
A virtual page you never even touch will just be a larger size in an extent book-keeping data structure in the kernel. (And in the user-space malloc
allocator). This doesn't add anything to the cost of allocating it, or to freeing it, or using the earlier pages that you do touch.