Here is a diagram I created for a call stack for a C++ program on Windows that uses the Windows x64 calling convention. It's more accurate and contemporary than the google image versions:

And corresponding to the exact structure of the above diagram, here is a debug of notepad.exe x64 on windows 7, where the first instruction of a function, 'current function' (because I forgot what function it is), is about to execute.

The low addresses and high addresses are swapped so the stack is climbing upwards in this diagram (it is a vertical flip of the first diagram, also note that the data is formatted to show quadwords and not bytes, so the little endianism cannot be seen). Black is the home space; blue is the return address, which is an offset into the caller function or label in the caller function to the instruction after the call; orange is the alignment; and pink is where rsp
is pointing after the prologue of the function, or rather, before the call is made if you are using alloca. The homespace_for_the_next_function+return_address
value is the smallest allowed frame on windows, and because the 16 byte rsp alignment right at the start of the called function must be maintained, it includes an 8 byte alignment as well, such that rsp
pointing to the first byte after the return address will be aligned to 16 bytes (because rsp
was guaranteed to be aligned to 16 bytes when the function was called and homespace+return_address = 40
, which is not divisible by 16 so you need an extra 8 bytes to ensure the rsp
will be aligned after the function makes a call). Because these functions do not require any stack locals (because they can be optimised into registers) or stack parameters/return values (as they fit in registers) and do not use any of the other fields, the stack frames in green are all alignment+homespace+return_address
in size.
The red function lines outline what the callee function logically 'owns' + reads / modifies by value in the calling convention without needing a reference to it (it can modify a parameter passed on the stack that was too big to pass in a register on -Ofast), and is the classic conception of a stack frame. The green frames demarcate what results from the call and the allocation the called function makes: The first green frame shows what the RtlUserThreadStart
actually allocates in the duration of the function call (from immediately before the call to executing the next call instruction) and goes from the first byte before the return address to the final byte allocated by the function prologue (or more if using alloca). RtlUserThreadStart
allocates the return address itself as null, so you see a sub rsp, 48h
and not sub rsp, 40h
in the prologue, because there is no call to RtlUserThreadStart
, it just begins execution at that rip
at the base of the stack.
Stack space that is needed by the function is assigned in the function prologue by decrementing the stack pointer.
For example, take the following C++, and the MASM it compiles to (-O0
).
typedef struct _struc {int a;} struc, pstruc;
int func(){return 1;}
int square(_struc num) {
int a=1;
int b=2;
int c=3;
return func();
}
_DATA SEGMENT
_DATA ENDS
int func(void) PROC ; func
mov eax, 1
ret 0
int func(void) ENDP ; func
a$ = 32 //4 bytes from rsp+32 to rsp+35
b$ = 36
c$ = 40
num$ = 64
//masm shows stack locals and params relative to the address of rsp; the rsp address
//is the rsp in the main body of the function after the prolog and before the epilog
int square(_struc) PROC ; square
$LN3:
mov DWORD PTR [rsp+8], ecx
sub rsp, 56 ; 00000038H
mov DWORD PTR a$[rsp], 1
mov DWORD PTR b$[rsp], 2
mov DWORD PTR c$[rsp], 3
call int func(void) ; func
add rsp, 56 ; 00000038H
ret 0
int square(_struc) ENDP ; square
As can be seen, 56 bytes are reserved, and the green stack frame will be 64 bytes in size when the call
instruction allocates the 8 byte return address as well.
The 56 bytes consist of 12 bytes of locals, 32 bytes of home space, and 12 bytes of alignment.
All callee register saving and storing register parameters in the home space happens in the prologue before the prologue reserves (using sub rsp, x
instruction) stack space needed by the main body of the function. The alignment is at the highest address of the space reserved by the sub rsp, x
instruction, and the final local variable in the function is assigned at the next lower address after that (and within the assignment for that primitive data type itself it starts at the lowest address of that assignment and works towards the higher addresses, bytewise, because it is little endian), such that the first primitive type (array cell, variable etc.) in the function is at the top of the stack, although the locals can be allocated in any order. This is shown in the following diagram for a different random example code to the above, that does not call any functions (still using x64 Windows cc):

If you remove the call to func()
, it only reserves 24 bytes, i.e. 12 bytes of of locals and 12 bytes of alignment. The alignment is at the start of the frame. When a function pushes something to the stack or reserves space on the stack by decrementing the rsp
, rsp
needs to be aligned, regardless of whether it is going to call another function or not. If the allocation of stack space can be optimised out and no homespace+return_addreess
is required because the function does not make a call, then there will be no alignment requirement as rsp
does not change. It also does not need to align if the stack will be aligned by 16 with just the locals (+ homespace+return_address
if it makes a call) that it needs to allocate, essentially it rounds up the space it needs to allocate to a 16 byte boundary.
rbp
is not used on the x64 Windows calling convention unless alloca
is used.
On gcc 32 bit cdecl and 64 bit system V calling conventions, rbp
is used, and the new rbp
points to the first byte after the old rbp
(only if compiling using -O0
, because it is saved to the stack on -O0
, otherwise, rbp
will point to the first byte after the return address). On these calling conventions, if compiling using -O0
, it will, after callee saved registers, store register parameters to the stack, and this will be relative to rbp
and part of the stack reservation done by the rsp
decrement. Data within the stack reservation done by the rsp
decrement is accessed relative rbp
rather than rsp
, unlike Windows x64 cc. On the Windows x64 calling convention, it stores parameters that were passed to it in registers to the homespace that was assigned for it if it is a varargs function or compiling using -O0
. If it is not a varargs function then on -O1
, it will not write them to the homespace but the homespace will still be provided to it by the calling function, this means that it actually accesses those variables from the register rather from the homespace location on the stack after it stores it there, unlike O0
(which saves them to the homespace and then accesses them through the stack and not the registers).
If a function call is placed in the function represented by the previous diagram, the stack will now look like this before the callee function's prologue starts (Windows x64 cc):

Orange indicates the part that the callee can freely arrange (arrays and structs remain contiguous of course, and work their way towards higher addresses, each element being little endian), so it can put the variables and the return value allocation in any order, and it passes a pointer for the return value allocation in rcx
for the callee to write to when the return type of the function it is calling cannot be passed in rax
. On -O0
, if the return value cannot be passed in rax
, there is also an anonymous variable created (as well as the return value space and as well as any variable it is assigned to, so there can be 3 copies of the struct). -Ofast
cant optimise out the return value space because it is return by value, but it optimises out the anonymous return variable if the return value is not used, or assigns it straight to the variable the return value is being assigned to without creating an anonymous variable, so -Ofast
has 2 / 1 copies and -O0
has 3 / 2 copies (return value assigned to a variable / return value not assigned to a variable). Blue indicates the part the callee must provide in exact order for the calling convention of the callee (the parameters must be in that order, such that the first stack parameter from left to right in the function signature is at the top of the stack, which is the same as how cdecl (which is a 32 bit cc) orders its stack parameters. The alignment for the callee can however be in any location, although I've only ever seen it to be between the locals and callee pushed registers.
If the function calls multiple functions, the call is in the same place on the stack for all the different possible callsites in the function, this is because the prologue caters for the whole function, including all calls it makes, and the parameters and homespace for any called function is always at the end of the allocation made in the prologue.
It turns out that C/C++ Microsoft calling convention only passes a struct in the registers if it fits into one register, otherwise it copies the local / anonymous variable and passes a pointer to it in the first available register. On gcc C/C++, if the struct does not fit in the first 2 parameter registers then it's passed on the stack and a pointer to it is not passed because the callee knows where it is due to the calling convention.
Arrays are passed by reference regardless of their size. So if you need to use rcx
as the pointer to the return value allocation then if the first parameter is an array, the pointer will be passed in rdx
, which will be a pointer to the local variable that is being passed. In this case, it does not need to copy it to the stack as a parameter because it's not passed by value. The pointer however is passed on the stack when passing by reference if there are no registers available to pass the pointer in.