What implements the stack in a typical process' memory?

Question

I have always been confused about where 'the stack' is implemented. I know about the typical memory layout for a process (at least on Unix-like systems) but I've always wondered what actually sets the structure of that layout. Is it the operating system? The compiler? If it is, then I don't understand how the x86 ISA can have a push instruction; wouldn't this mean that some kind of stack must exist before any OS is even loaded?

The OS sets the stack of a process, so it can use `push` and similar immediately. The OS set its own stack by not using `push` (and similar) before setting the stack (initially the OS may rely on the stack set up by the firmware, in that case switch the term "OS" with "firmware"). Setting the stack is conceptually easy: just point `r/e/sp` to a free area of memory. The details are a little more involved but irrelevant at this level in my opinion. — Margaret Bloom, Sep 27 '22 at 20:59
Process creation is the first step in loading a program -- the OS is already up and running when it performs process creation. That will create one thread, a main thread, which will be given a stack. Once running, the program can request new threads that each get their own stack memory area of the address space. Processes are isolated from each other and from the kernel memory used by the operating system. Yes, a `push` operation without a valid value in `rsp` would be bad, so before any `push` can be used a stack must exist and top of stack in the stack pointer register. That's booting. — Erik Eidt, Sep 27 '22 at 21:15

Peter Cordes · Answer 1 · 2022-09-29T19:21:41.317

In mainstream OSes, the OS maps some stack memory and enters user-space with RSP pointing near the top of that mapping.

But that's a design choice, not a necessity. An OS could require user-space processes to lea rsp, [rel top_of_some_bss_array] or whatever early in _start, before running any instructions that use the stack. (And before installing any signal handlers or anything else that could asynchronously use RSP.)

(In a protect-mode or 64-bit OS, the kernel normally sets things up so hardware interrupts use a separate kernel stack, not the user-space stack pointer, for security reasons. But if not, or in 16-bit DOS, having a valid stack pointer at all times was important unless you disabled interrupts. )

For example in Linux: Analyzing memory mapping of a process with pmap. [stack] discusses how the initial stack grows, but how that magic only works for the main thread's stack; new threads in the same process do need you (or the thread library) to manually mmap some space. But the thread-creation system call, clone, takes an arg for what to set the new thread's stack pointer to. So the API is still designed around every task (thread) having a valid stack pointer before it starts.

Also related: Beginning of stack on Linux - the Linux kernel randomizes the initial stack pointer. It also uses stack space to pass argc, argv, and envp to user-space, along with storing the pointed-to arg and environment strings.

Instruction can exist that require some setup to use.

For example, at power-on in an x86 CPU, rsp holds 0, or at least esp on Intel; see comments. (The machine boots in 16-bit unreal mode, so only ss:sp is initially relevant). A push would wrap to ss:FFFE.

The BIOS code at the reset vector should set ss:sp to point somewhere before running any push or pop, or call/ret, or enabling interrupts (which will asynchronously use the stack). I assume the system boots with IF=0 because software won't yet have stored an interrupt table and used lidt. (This code I'm talking about is in the BIOS itself, before your own code could run via UEFI or as a legacy BIOS MBR bootloader.)

Similarly, the existence of xlat doesn't imply that RBX is always a valid pointer. Don't put an xlat in your code where it will run when RBX isn't a valid pointer, though!
Or just don't use it, since it's not very fast. Same for loop, RCX isn't always a valid loop counter. And again, it's not fast except on recent AMD CPUs; prefer dec ecx / jnz.

For convenience and efficiency, it works much well to have the OS just provide a stack mapping, instead of requiring the process to allocate its own stack in the BSS, or with an mmap system call or something. It's assumed that every process will want a stack, so might as well just have that set up ahead of time before entering user-space.

The AMD Programmer's Manual specifically states that most registers, including RSP, have an initial value of 0. Curiously, the Intel version specifically refers to ESP and others having an initial value of 00000000H but doesn't seem to clarify about the upper half. — sj95126, Sep 29 '22 at 13:30
Probably, Intel didn't bother because the initial value of registers is actually in real mode when the computer has just booted. If the firmware puts anything in those 16 bits registers, then I think enabling protected mode doesn't change those first 16 bits and simply extends the registers to 32 bits. It would actually make sense that the manual refers only to the 16 first bits of the registers. Like Peter says, the registers can contain garbage at boot. It isn't really necessary to set them at 0 because the value 0 is not useful. Thus, the firmware/OS has to modify their value anyway. — user123, Sep 29 '22 at 14:57
Maybe AMD added some booting logic that sets all registers to 0 and Intel didn't. After all, x86-64 is a standard. The actual implementation of that standard differs a lot between AMD and Intel. The value in registers at boot is a minor detail. — user123, Sep 29 '22 at 14:59
@user123: 16-bit mode can access 32-bit registers. The `66h` and `67h` prefixes use the other operand size, the one that's not the default for the mode. Only the upper halves of 64-bit registers (and r8-r15) are inaccessible immediately after boot, same as in 32-bit protected mode. But yes, if you only ever write the 16-bit low half, never ESI for example, you can read the upper half of RSI after switching to 64-bit mode. For RSP, this might mean using 16-bit protected mode with 16-bit stack size on the way to long mode, so you can still push/pop. — Peter Cordes, Sep 29 '22 at 18:31
@user123: changing mode does *not* zero-extend registers. That led to Linux kernel data leaks when it didn't zero R8-R11 before returning to 32-bit code from `int 0x80`; 32-bit user-space could far-jmp to a 64-bit code segment and read whatever values compiler-generated code left in those registers, which are call-clobbered in the x86-64 SysV calling convention. ([These days](https://stackoverflow.com/questions/46087730/what-happens-if-int-0x80) the int 0x80 entry point still saves zeros for R8-R11, but saves the full 64-bit values of all other regs into this task's struct pt_regs.) — Peter Cordes, Sep 29 '22 at 18:41
@sj95126: I wonder if Intel's documentation of the initial values of 32-bit registers being `0` is something they consider as a *write* of the 32-bit register, which would implicitly zero-extend into the full 64-bit register (regardless of the mode). I'd assume that in practice all real Intel CPUs *do* zero the full register, rather than building special hardware that lets them leave garbage or an easter-egg in the high halves. — Peter Cordes, Sep 29 '22 at 18:41

score 0 · Answer 2 · answered Sep 29 '22 at 13:32

If we think of say windows, linux, macos, etc. The operating system will have rules for binary file formats, it will have a loader that loads the binaries that it supports and it may also be known for example does the application need to initialize .data and .bss or will the operating systems loader zero .bss for you based on entries in the binary?

But also the memory space for a typical application.

Then when you or someone builds say for example gcc or llvm toolchain for this operating system and target (x86, arm, etc) it is also building knowing the target operating system, defaulting, ideally, to the preferred binary format. The gnu c library will be specific to that operating system and so on.

So while it may seem that I just download a pre-built windows gnu toolchain or I download a pre-built linux toolchain and they both have gcc and other tools. Those two builds are quite different with respect to the backend of the C library and how the binaries are built (linker script, etc).

The stack space is ultimately your responsibility as the programmer, as well as heap, .text, etc. And if you are doing bare metal for an mcu you may or may not see this (a lot of folks use the canned vendor stuff and here again do not see what is going on). But almost every one uses pre-built toolchains or even if you build your own say gnu or llvm from sources, you will still likely get the defaults prepared for you by someone else that match the operating system environment.

What implements the stack in a typical process' memory?

2 Answers2

Instruction can exist that require some setup to use.