3

My understanding of the memory structure under C is a program's memory is split with the stack and the heap each growing from either end of the block conceivably allocating all of ram but obviously abstracted to some kind of OS memory fragment manager thing.
Stack designed for handling local variables (automatic storage) and heap for memory allocation (dynamic storage).

(Editor's note: there are C implementations where automatic storage doesn't use a "call stack", but this question assumes a normal modern C implementation on a normal CPU where locals do use the callstack if they can't just live in registers.)


Say I want to implement a stack data structure for some data parsing algorithm. Its lifetime and scope is limited to one function.

I can think of 3 ways to do such a thing, yet none of them seem to me as the cleanest way to go about this given the scenario.

My first though is to construct a stack in the heap, like C++ std::vector:

Some algorithm(Some data)
{
  Label *stack = new_stack(stack_size_estimate(data));
  Iterator i = some_iterator(data);
  while(i)
  {
    Label label = some_label(some_iterator_at(i));
    if (label_type_a(label))
    {
      push_stack(stack,label);
    }
    else if(label_type_b(label))
    {
      some_process(&data,label,pop_stack(stack));
    }
    i = some_iterator_next(i);
  }
  some_stack_cleanup(&data,stack);
  delete_stack(stack);
  return data;
}

This method is alright but it's wasteful in that the stack size is a guess and at any moment push_stack could call some internal malloc or realloc and cause irregular slowdowns. None of which are problems for this algorithm, but this construct seems better suited for applications in which a stack has to be maintained across multiple contexts. That isn't the case here; the stack is private to this function and is deleted before exit, just like automatic storage class.


My next thought is recursion. Because recursion uses the builtin stack this seems closer to what I want.

Some algorithm(Some data)
{
  Iterator i = some_iterator(data);
  return some_extra(algorithm_helper(extra_from_some(data),&i);
}
Extra algorithm_helper(Extra thing, Iterator* i)
{
  if(!*i)
  {return thing;}
  {
    Label label = some_label(some_iterator_at(i));
    if (label_type_a(label))
    {
      *i = some_iterator_next(*i);
      return algorithm_helper
      (  extra_process( algorithm_helper(thing,i), label),  i  );
    }
    else if(label_type_b(label))
    {
      *i = some_iterator_next(*i);
      return extra_attach(thing,label);
    }
  }
}

This method saves me from writing and maintaining a stack. The code, to me, seems harder to follow, not that it matters to me.

My main issue with it is this is using way more space.
With stack frames holding copies of this Extra construct (which basically contains the Some data plus the actual bits wanted to be held in the stack) and unnecessary copies of the exact same iterator pointer in every frame: because it's "safer" then referencing some static global (and I don't know how to not do it this way). This wouldn't be a problem if the compiler did some clever tail recursion like thing but I don't know if I like crossing my fingers and hope my compiler is awesome.


The third way I can think of involves some kind of dynamic array thing that can grow on the stack being the last thing there written using some kind of C thing I don't know about.
Or an extern asm block.

Thinking about this, that's what I'm looking for but I don't see my self writing an asm version unless it's dead simple and I don't see that being easier to write or maintain despite it seeming simpler in my head. And obviously it wouldn't be portable across ISAs.

I don't know if I'm overlooking some feature or if I need to find another language or if I should rethink my life choices. All could be true... I hope it's just the first one.

I'm not opposed to using some library. Is there one, and if so how does it work? I didn't find anything in my searches.


I recently learned about Variable Length Arrays and I don't really understand why they couldn't be leveraged as a way to grow stack reference, but also I can't imagine them working that way.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
haelmic
  • 541
  • 4
  • 18
  • 1
    I confess I'm not clear what your concerns are. I'd go with a dynamically allocated stack (that might be the first or third variant) that resizes when necessary (make a guess at how big a stack size you would normally need; allocate enough stack space for that, or maybe twice that size; grow later if necessary. Implement something simple; measure whether performance is actually an issue. When you know where the bottleneck is in the simple solution, you'll have information about how best to relieve the bottleneck. I would not attempt an inline stack; I'd use functions, possibly `inline` ones. – Jonathan Leffler Jan 02 '20 at 07:31
  • 2
    If you don't know how big your stack needs to be, using VLA (variable length array) technology is unlikely to help. – Jonathan Leffler Jan 02 '20 at 07:34

3 Answers3

3

tl; dr: use std::vector or an equivalent.


(Edited)

Regarding your opening statement: The days of segments are over. These days processes have multiple stacks (one for each thread), but all share one heap.

Regarding option 1: Instead of writing and maintaining a stack, and guessing its size, you should just literally use std::vector, or a C wrapper around it, or a C clone of it - in any case, use the 'vector' data structure.

Vector's algorithm is generally quite efficient. Not perfect, but generally good for many, may real-world use cases.

Regarding option 2: You are right, at least as long as the discussion is confined to C. In C, recursion is both wasteful and non-scalable. In some other languages, notably in functional languages, recursion is the way to express these algorithms, and tail-call optimization is part of the language definition.

Regarding option 3: The closest to that C thing you're looking for is alloca(). It allows you to grow the stack frame, and if the stack doesn't have enough memory, the OS will allocate it. However, it's going to be quite difficult to build a stack object around it, since there's no realloca(), as pointed out by @Peter Cordes.

The Other drawback is that stacks are still limited. On Linux, the stack is typically limited to 8 MB. This is the same scalability limitation as with recursion.

Regarding variable length arrays: VLAs are basically syntactic sugar, a notation convenience. Beyond syntax, they have the exact same capabilities of arrays (actually, even fewer, viz. sizeof() doesn't work), let alone the dynamic power of std::vector.

root
  • 5,528
  • 1
  • 7
  • 15
  • Yes to most of this, but no, `alloca()` doesn't let you grow an existing `alloca` allocation. There is no `alloca` version of `realloc`. One of the obstacles is that stacks grow downward on most systems. I posted [an answer](https://stackoverflow.com/questions/59559587/using-the-callstack-to-implement-a-stack-data-structure-in-c/59561598#59561598) with a C implementation that *heavily* abuses `alloca` and UB to keep growing a stack data structure downward (toward lower addresses), mostly to show how evil it would be to do in C what is fairly "natural" in asm. – Peter Cordes Jan 02 '20 at 10:39
  • And BTW, if you want a really efficient `std::vector` in C, write your own that can use `realloc`. Don't actually use C++ std::vector; it *always* copies without trying to extend the allocation in-place, because C++ allocators are dumb and don't support a realloc interface [Why is there no reallocation functionality in C++ allocators?](//stackoverflow.com/q/3105001). (Virtual memory means it's pretty common for there to be free pages after a large allocation, allowing growth without copying.) – Peter Cordes Jan 03 '20 at 09:52
2

Not really a true answer, but a bit too long for a mere comment.

In fact, the image of the stack and the heap and growing each towards the other is over simplistic. It used to be true with the 8086 processor series (at least in some memory models) where the stack and the heap shared one single segment of memory, but even the old Windows 3.1 system came with some API functions allowing to allocate memory outside of the heap (search for GlobalAlloc opposited to LocalAlloc), provided the processor was at least a 80286.

But modern systems all use virtual memory. With virtual memory there is no longer a nice consecutive segment shared by the heap and the stack, and the OS can give memory wherever it want (provided of course it can find free memory somewhere). But the stack has still to be a consecutive segment. Because of that, the size of that segment is determinated at build time and is fixed, while the size of the heap is only constrained by the maximum memory the system can allocate to the process. That is the reason why many recommend to use the stack only for small data structures and always allocate large ones. Furthermore, I know no portable way for a program to know its stacksize, not speaking of its free stacksize.

So IMHO, what is important here is to wonder whether the size of your stack is small enough. If it can exceed a small limit, just go for allocated memory, because it will be both easier and more robust. Because you can (and should) test for allocation errors, but a stack overflow is always fatal.

Finally, my advice is not to try to use the system stack for your own dedicated usage, even if limited to one single function, except if you can cleanly ask for a memory array in the stack and build your own stack management over it. Using assembly language to directly use the underlying stack will add a lot of complexity (not speaking of lost portability) for a hypothetic minimal gain. Just don't. Even if you want to use assembly language instructions for a low level optimization of your stack, my advice is to use a dedicated memory segment and leave the system stack for the compiler.

My experience says that the more complexity you put into your code, the harder it will be to maintain, and the less robust.

So just follow best practices and only use low level optimizations when and where you cannot avoid them.

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
2

In practice, if you can't set a hard upper bound on possible size of less than 1kiB or so, you should normally just dynamically allocate. If you can be sure the size is that small, you could consider using alloca as the container for your stack.

(You can't conditionally use a VLA, it has to be in-scope. Although you could have its size be zero by declaring it after an if(), and set a pointer variable to the VLA address, or to malloc. But alloca would be easier.)

In C++ you'd normally std::vector, but it's dumb because it can't / doesn't use realloc (Does std::vector *have* to move objects when growing capacity? Or, can allocators "reallocate"?). So in C++ it's a tradeoff between more efficient growth vs. reinventing the wheel, although it's still amortized O(1) time. You can mitigate most of it with a fairly large reserve() up front, because memory you alloc but never touch usually doesn't cost anything.

In C you have to write your own stack anyway, and realloc is available. (And all C types are trivially copyable, so there's nothing stopping you from using realloc). So when you do need to grow, you can realloc the storage. But if you can't set a reasonable and definitely-large-enough upper bound on function entry and might need to grow, then you should still track capacity vs. in-use size separately, like std::vector. Don't call realloc on every push/pop.


Using the callstack directly as a stack data structure is easy in pure assembly language (for ISAs and ABIs that use a callstack i.e. "normal" CPUs like x86, ARM, MIPS, etc). And yes, in asm worth doing for stack data structures that you know will be very small, and not worth the overhead of malloc / free.

Use asm push or pop instructions (or equivalent sequence for ISAs without a single-instruction push / pop). You can even check the size / see if the stack data structure is empty by comparing against a saved stack-pointer value. (Or just maintain an integer counter along side your push/pops).

A very simple example is the inefficient way some people write int->string functions. For non-power-of-2 bases like 10, you generate digits in least-significant first order by dividing by 10 to remove them one at a time, with digit = remainder. You could just store into a buffer and decrement a pointer, but some people write functions that push in the divide loop and then pop in a second loop to get them in printing order (most-significant first). e.g. Ira's answer on How do I print an integer in Assembly Level Programming without printf from the c library? (My answer on the same question shows the efficient way which is also simpler once you grok it.)

It doesn't particularly matter that the stack grows towards the heap, just that there is some space you can use. And that stack memory is already mapped, and normally hot in cache. This is why we might want to use it.

Stack above heap happens to be true under GNU/Linux for example, which normally puts the main thread's user-space stack near the top of user-space virtual address space. (e.g. 0x7fff...) Normally there's a stack-growth limit that's much smaller than the distance from stack to heap. You want an accidental infinite recursion to fault early, like after consuming 8MiB of stack space, not drive the system to swapping as it uses gigabytes of stack. Depending on the OS, you can increase the stack limit, e.g. ulimit -s. And thread stacks are normally allocated with mmap, the same as other dynamic allocation, so there's no telling where they'll be relative to other dynamic allocation.


AFAIK it's impossible from C, even with inline asm

(Not safely, anyway. An example below shows just how evil you'd have to get to write this in C the way you would in asm. It basically proves that modern C is not a portable assembly language.)

You can't just wrap push and pop in GNU C inline asm statements because there's no way to tell the compiler you're modifying the stack pointer. It might try to reference other local variables relative to the stack pointer after your inline asm statement changed it.

Possibly if you knew you could safely force the compiler to create a frame pointer for that function (which it would use for all local-variable access) you could get away with modifying the stack pointer. But if you want to make function calls, many modern ABIs require the stack pointer to be over-aligned before a call. e.g. x86-64 System V requires 16-byte stack alignment before a call, but push/pop work in units of 8 bytes. OTOH, 32-bit ARM (and some 32-bit x86 calling conventions, e.g. Windows) don't have that feature so any number of 4-byte pushes would leave the stack correctly aligned for a function call.

I wouldn't recommend it, though; if you want that level of optimization (and you know how to optimize asm for the target CPU), it's probably safer to write your whole function in asm.


Variable Length Arrays and I don't really understand why they couldn't be leveraged as a way to grow stack reference

VLAs aren't resizable. After you do int VLA[n]; you're stuck with that size. Nothing you can do in C will guarantee you more memory that's contiguous with that array.

Same problem with alloca(size). It's a special compiler built-in function that (on a "normal" implementation) decrements the stack pointer by size bytes (rounded to a multiple of the stack width) and returns that pointer. In practice you can make multiple alloca calls and they will very likely be contiguous, but there's zero guarantee of that so you can't use it safely without UB. Still, you might get away with this on some implementations, at least for now until future optimizations notice the UB and assume that your code can't be reachable.

(And it could break on some calling conventions like x86-64 System V where VLAs are guaranteed to be 16-byte aligned. An 8-byte alloca there probably rounds up to 16.)

But if you did want to make this work, you'd maybe use long *base_of_stack = alloca(sizeof(long)); (the highest address: stacks grow downward on most but not all ISAs / ABIs - this is another assumption you'd have to make).

Another problem is that there's no way to free alloca memory except by leaving the function scope. So your pop has to increment some top_of_stack C pointer variable, not actually moving the real architectural "stack pointer" register. And push will have to see whether the top_of_stack is above or below the high-water mark which you also maintain separately. If so you alloca some more memory.

At that point you might as well alloca in chunks larger than sizeof(long) so the normal case is that you don't need to alloc more memory, just move the C variable top-of-stack pointer. e.g. chunks of 128 bytes maybe. This also solves the problem of some ABIs keeping the stack pointer over-aligned. And it lets the stack elements be narrower than the push/pop width without wasting space on padding.

It does mean we end up needing more registers to sort of duplicate the architectural stack pointer (except that the SP never increases on pop).

Notice that this is like std::vector's push_back logic, where you have an allocation size and an in-use size. The difference is that std::vector always copies when it wants more space (because implementations fail to even try to realloc) so it has to amortize that by growing exponentially. When we know growth is O(1) by just moving the stack pointer, we can use a fixed increment. Like 128 bytes, or maybe half a page would make more sense. We're not touching memory at the bottom of the allocation immediately; I haven't tried compiling this for a target where stack probes are needed to make sure you don't move RSP by more than 1 page without touching intervening pages. MSVC might insert stack probes for this.

Hacked up alloca stack-on-the-callstack: full of UB and miscompiles in practice with gcc/clang

This mostly exists to show how evil it is, and that C is not a portable assembly language. There are things you can do in asm you can't do in C. (Also including efficiently returning multiple values from a function, in different registers, instead of a stupid struct.)

#include <alloca.h>
#include <stdlib.h>

void some_func(char);

// assumptions:
//   stack grows down
//   alloca is contiguous
//   all the UB manages to work like portable assembly language.

// input assumptions: no mismatched { and }

// made up useless algorithm: if('}') total += distance to matching '{'
size_t brace_distance(const char *data)
{
  size_t total_distance = 0;
  volatile unsigned hidden_from_optimizer = 1;
  void *stack_base = alloca(hidden_from_optimizer);      // highest address. top == this means empty
             // alloca(1) would probably be optimized to just another local var, not necessarily at the bottom of the stack frame.  Like  char foo[1]
  static const int growth_chunk = 128;
  size_t *stack_top = stack_base;
  size_t *high_water = alloca(growth_chunk);

  for (size_t pos = 0; data[pos] != '\0' ; pos++) {
    some_func(data[pos]);
    if (data[pos] == '{') {
        //push_stack(stack, pos);
        stack_top--;
        if (stack_top < high_water)      // UB: optimized away by clang; never allocs more space
            high_water = alloca(growth_chunk);
        // assert(high_water < stack_top && "stack growth happened somewhere else");
        *stack_top = pos;
    }
    else if(data[pos] == '}')
    {
        //total_distance += pop_stack(stack);
        size_t popped = *stack_top;
        stack_top++;
        total_distance += pos - popped;
        // assert(stack_top <= stack_base)
    }
  }

  return total_distance;
}

Amazingly, this seems to actually compile to asm that looks correct (on Godbolt), with gcc -O1 for x86-64 (but not at higher optimization levels). clang -O1 and gcc -O3 optimize away the if(top<high_water) alloca(128) pointer compare so this is unusable in practice.

< pointer comparison of pointers derived from different objects is UB, and it seems even casting to uintptr_t doesn't make it safe. Or maybe GCC is just optimizing away the alloca(128) based on the fact that high_water = alloca() is never dereferenced. https://godbolt.org/z/ZHULrK shows gcc -O3 output where there's no alloca inside the loop. Fun fact: making volatile int growth_chunk to hide the constant value from the optimizer makes it not get optimized away. So I'm not sure it's pointer-compare UB that's causing the issue, it's more like accessing memory below the first alloca instead of dereferencing a pointer derived from the second alloca that gets compilers to optimize it away.

# gcc9.2 -O1 -Wall -Wextra
# note that -O1 doesn't include some loop and peephole optimizations, e.g. no xor-zeroing
# but it's still readable, not like -O1 spilling every var to the stack between statements.

brace_distance:
        push    rbp
        mov     rbp, rsp      # make a stack frame
        push    r15
        push    r14
        push    r13           # save some call-preserved regs for locals
        push    r12           # that will survive across the function call
        push    rbx
        sub     rsp, 24
        mov     r12, rdi
        mov     DWORD PTR [rbp-52], 1
        mov     eax, DWORD PTR [rbp-52]
        mov     eax, eax
        add     rax, 23
        shr     rax, 4
        sal     rax, 4              # some insane alloca rounding?  Why not AND?
        sub     rsp, rax            # alloca(1) moves the stack pointer, RSP, by whatever it rounded up to
        lea     r13, [rsp+15]
        and     r13, -16            # stack_base = 16-byte aligned pointer into that allocation.
        sub     rsp, 144            # alloca(128) reserves 144 bytes?  Ok.
        lea     r14, [rsp+15]
        and     r14, -16            # and the actual C allocation rounds to %16

        movzx   edi, BYTE PTR [rdi]  # data[0] check before first iteration
        test    dil, dil
        je      .L7                  # if (empty string) goto return 0

        mov     ebx, 0               # pos = 0
        mov     r15d, 0              # total_distance = 0
        jmp     .L6
.L10:
        lea     rax, [r13-8]         # tmp_top = top-1
        cmp     rax, r14
        jnb     .L4                  # if(tmp_top < high_water)
        sub     rsp, 144
        lea     r14, [rsp+15]
        and     r14, -16             # high_water = alloca(128) if body
.L4:
        mov     QWORD PTR [r13-8], rbx   # push(pos) - the actual store
        mov     r13, rax             # top = tmp_top completes the --top
            # yes this is clunky, hopefully with more optimization gcc would have just done
            # sub r13, 8  and used [r13] instead of this RAX tmp
.L5:
        add     rbx, 1      # loop condition stuff
        movzx   edi, BYTE PTR [r12+rbx]
        test    dil, dil
        je      .L1
.L6:                        # top of loop body proper, with 8-bit DIL = the non-zero character
        movsx   edi, dil               # unofficial part of the calling convention: sign-extend narrow args
        call    some_func              # some_func(data[pos]
        movzx   eax, BYTE PTR [r12+rbx]   # load data[pos]
        cmp     al, 123                   # compare against braces
        je      .L10
        cmp     al, 125
        jne     .L5                    # goto loop condition check if nothing special
                         # else: it was a '}'
        mov     rax, QWORD PTR [r13+0]
        add     r13, 8                  # stack_top++ (8 bytes)
        add     r15, rbx                # total += pos
        sub     r15, rax                # total -= popped value
        jmp     .L5                     # goto loop condition.
.L7:
        mov     r15d, 0
.L1:
        mov     rax, r15                # return total_distance
        lea     rsp, [rbp-40]           # restore stack pointer to point at saved regs
        pop     rbx                     # standard epilogue
        pop     r12
        pop     r13
        pop     r14
        pop     r15
        pop     rbp
        ret

This is like you'd do for a dynamically allocated stack data structure except:

  • it grows downward like the callstack
  • we get more memory from alloca instead of realloc. (realloc can also be efficient if there's free virtual address space after the allocation). C++ chose not to provide a realloc interface for their allocator, so std::vector always stupidly allocs + copies when more memory is required. (AFAIK no implementations optimize for the case where new hasn't been overridden and use a private realloc).
  • it's totally unsafe and full of UB, and fails in practice with modern optimizing compilers
  • the pages will never get returned to the OS: if you use a large amount of stack space, those pages stay dirty indefinitely.

If you can choose a size that's definitely large enough, you could use a VLA of that size.

I'd recommend starting at the top and going downward, to avoid touching memory far below the currently in-use region of the callstack. That way, on an OS that doesn't need "stack probes" to grow the stack by more than 1 page, you might avoid ever touching memory far below the stack pointer. So the small amount of memory you do end up using in practice might all be within an already mapped page of the callstack, and maybe even cache lines that were already hot if some recent deeper function call already used them.


If you do use the heap, you can minimize realloc costs by doing a pretty large allocation. Unless there was a block on the free-list that you could have gotten with a smaller allocation, in general over-allocating have very low cost if you never touch parts you didn't need, especially if you free or shrink it before doing any more allocations.

i.e. don't memset it to anything. If you want zeroed memory, use calloc which may be able to get zeroed pages from the OS for you.

Modern OSes use lazy virtual memory for allocations so the first time you touch a page, it typically has to page-fault and actually get wired into the HW page tables. Also a page of physical memory has to get zeroed to back this virtual page. (Unless the access was a read, then Linux will copy-on-write map the page to a shared physical page of zeros.)

A virtual page you never even touch will just be a larger size in an extent book-keeping data structure in the kernel. (And in the user-space malloc allocator). This doesn't add anything to the cost of allocating it, or to freeing it, or using the earlier pages that you do touch.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847