10

EDIT:

I have accepted an answer below and also added my own with my final revision of the code. Hopefully it shows people actual examples of Shadow Space allocation rather than more words.

EDIT 2: I also managed to find a link to a calling conventions PDF in the Annotations of a YouTube video (of all things) which has some interesting tidbits on Shadow Space and the Red Zone on Linux. It can be found here: http://www.agner.org/optimize/calling_conventions.pdf

ORIGINAL:

I have looked at a couple of other questions here and all over the internet but I can't seem to find a proper example of allocating "Shadow Space" when calling a subroutine/Windows API in 64 bit Windows assembly.

My understanding is this:

  • Caller should sub rsp,<bytes here> prior to call callee
  • Callee should use it to store registers if need be (or local variables, if register saving isn't required)
  • Caller cleans it up, e.g: add rsp,<bytes here>
  • The amount allocated should be aligned to 32 bytes

With that in mind, this is what I have tried:

section .text

start:

    sub rsp,0x20 ; <---- Allocate 32 bytes of "Shadow space"

    mov rcx,msg1
    mov rdx,msg1.len
    call write

    add rsp,0x20

    mov rcx,NULL
    call ExitProcess

    ret

write:

    mov [rsp+0x08],rcx      ; <-- use the Shadow space
    mov [rsp+0x10],rdx      ; <-- and again

    mov rcx,STD_OUTPUT_HANDLE   ; Get handle to StdOut
    call GetStdHandle

    mov rcx,rax         ; hConsoleOutput
    mov rdx,[rsp+0x08]      ; lpBuffer
    mov r8,[rsp+0x10]       ; nNumberOfCharsToWrite
    mov r9,empty        ; lpNumberOfCharsWritten
    push NULL           ; lpReserved
    call WriteConsoleA

    ret

My two strings are "Hello " and "World!\n". This manages to print "Hello " before crashing. I have a suspicion that I am doing it correctly ... except I should be cleaning up somehow (and I'm not sure how).

What am I doing wrong? I have tried a combination of sizes and also tried "allocating Shadow Space" prior to the WinAPI calls too (am I supposed to be doing that?).

It should be noted that this works perfectly fine when I don't care about Shadow Space at all. However, I am trying to be compliant with the ABI since my write function calls WinAPIs (and is therefore, not a leaf function).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Simon Whitehead
  • 63,300
  • 9
  • 114
  • 138
  • 1
    Perhaps [The history of calling conventions, part 5: amd64](http://blogs.msdn.com/b/oldnewthing/archive/2004/01/14/58579.aspx) would be helpful? In particular note the need for the called function to realign the stack, looks like you're not doing that. – Harry Johnston Oct 22 '15 at 07:17
  • Thanks @HarryJohnston. This is on my list of things to read tomorrow morning (its a bit late now!). I will check back to let you know how I go :) – Simon Whitehead Oct 22 '15 at 11:40
  • In addition to the other issues mentioned, you also forgot to generate unwind data so that the system can walk the stack if an exception occurs. – Raymond Chen Oct 22 '15 at 14:12
  • @RaymondChen Are you able to elaborate? "Unwind data" is a new term for me (by the way: your blog has been very helpful to me over the years :)) – Simon Whitehead Oct 22 '15 at 20:30
  • [This](https://www.tortall.net/projects/yasm/manual/html/objfmt-win64-exception.html) looks like it might be useful. But more generally all of the first page of Google results for "x64 unwind data". :-) – Harry Johnston Oct 22 '15 at 22:47
  • Interesting. I am interpreting that as, Unwind data = making sure you provide shadow space and spill the registers in to it. So, assuming I get all of the issues with this question sorted, the "unwind data" portion is essentially taken care of? – Simon Whitehead Oct 23 '15 at 00:13
  • I don't think so - "The identity of the frame pointer register and this offset, which must be a multiple of 16 bytes, is recorded in the unwind data". That page describes a bunch of Yasm primitives to generate unwind data, I guess what you really need is the corresponding page for Nasm. – Harry Johnston Oct 23 '15 at 00:34
  • Indeed. This is ... yet another ... entire piece of functionality I was unaware of :) I have found some NASM samples and the basic gist of it seems to be, that your unwind data and exception handler should return your routine to the state that it was in when it was first called. This makes sense, as how does the environment magically know how to restore the stack frame if you've gone and modified it throughout your routine/frame methods? It can't.. you have to tell it. I guess I'll have to look in to this more (as its become its own question). Thanks everyone. – Simon Whitehead Oct 23 '15 at 02:31
  • @RaymondChen Even though I now have a working solution, if you have any resources regarding unwind data and SEH that you recommend I would be interested in reading them. – Simon Whitehead Oct 23 '15 at 13:22
  • You can [read about it on MSDN](https://msdn.microsoft.com/en-us/library/ms235231.aspx) – Raymond Chen Oct 23 '15 at 19:05
  • 1
    @RaymondChen: The link in your previous comment is dead, unfortunately. https://learn.microsoft.com/en-us/cpp/build/x64-calling-convention?view=msvc-170 is the current link for the calling convention in general. https://learn.microsoft.com/en-us/cpp/build/exception-handling-x64?view=msvc-170 is "x64 exception handling" in the "x64 ABI conventions" section. – Peter Cordes Jul 19 '22 at 15:21

2 Answers2

10

The shadow space must be provided directly previous to the call. Imagine the shadow space as a relic from the old stdcall/cdecl convention: For WriteFile you needed five pushes. The shadow space stands for the last four pushes (the first four arguments). Now you need four registers, the shadow space (just the space, contents don't matter) and one value on the stack after the shadow space (which is in fact the first push). Currently the return address to the caller (start) is in the space that WriteFile will use as shadow space -> crash.

You can create a new shadow space for the WinAPI functions (GetStdHandle and WriteConsoleA) inside the function write:

write:
    push rbp
    mov rbp, rsp
    sub rsp, (16 + 32)      ; 5th argument of WriteConsoleA (8) + Shadow space (32)
                            ; plus another 8 to make it a multiple of 16 (to keep stack aligned after one push aligned it after function entry)

    mov [rbp+16],rcx        ; <-- use our Shadow space, provided by `start`
    mov [rbp+24],rdx        ; <-- and again, to save our incoming args

    mov rcx, -11            ; Get handle to StdOut
    call GetStdHandle

    mov rcx,rax             ; hConsoleOutput
    mov rdx, [rbp+16]       ; lpBuffer        ; reloaded saved copy of register arg
    mov r8, [rbp+24]        ; nNumberOfCharsToWrite
    mov r9,empty            ; lpNumberOfCharsWritten
    mov qword [rsp+32],0    ; lpReserved - 5th argument directly behind the shadow space
    call WriteConsoleA

    leave
    ret
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
rkhb
  • 14,159
  • 7
  • 32
  • 60
  • 1
    This confuses me even more now because you've stated that the Shadow Space should be provided "directly previous to the call" (that is as I understand it also). However, your example sets up the local stack as per usual within the function (directly _after_ the call). So which one is it? (I don't mean to sound rude - I'm still just uncertain of what you mean) – Simon Whitehead Oct 22 '15 at 11:44
  • 1
    Oh sorry - I see now. The shadow space in your example is set up for the calls to the WinAPIs. You're saying my example above is correct I just need to add Shadow space for the calls to the WinAPI. Is that correct? – Simon Whitehead Oct 22 '15 at 11:47
  • 1
    @SimonWhitehead: Yes. BTW: I stumbled over `add rsp,0x28`, which doesn't match `sub rsp,0x20`. This doesn't matter here, but you you could get in trouble in the future. – rkhb Oct 22 '15 at 11:55
  • Right.. I think I understand now. The shadow space cascades downward through method calls. I should set up some shadow space in main and then in the write method I should reserve another 32 byte shadow space for the WinAPI calls and can use the shadow space in main via an offset from `rbp`. That makes sense to me. I will try it tomorrow! Thanks so much! – Simon Whitehead Oct 22 '15 at 12:27
  • RE: 32 vs 40 - yeah sorry this was a mistake I caught. Thanks! :) – Simon Whitehead Oct 22 '15 at 12:28
  • @SimonWhitehead: I made a mistake. The fifth parameter cannot be pushed, because the stack pointer has to point to the shadow space (and not to the fifth parameter). Look at my changed answer. – rkhb Oct 22 '15 at 13:14
  • Ah yes thanks @rkhb. That makes sense (your update). I will try this on my lunch break today :) – Simon Whitehead Oct 22 '15 at 21:16
  • Fantastic @rkhb. This worked perfectly! I feel like I have a much better understanding of the material on x64 calling conventions .. except now I need to understand what Raymond said in his comment above. Again, thanks so much! :) – Simon Whitehead Oct 22 '15 at 22:00
  • `sub rsp, (8 + 32)` after one `push` will misalign the stack for `call GetStdHandle`, I think. It would be appropriate if you were going to do another `push` of an arg instead of `mov`. – Peter Cordes Jun 17 '20 at 20:21
  • @PeterCordes Yea, `call` will push RIP into the stack (+ 8 bytes), which will make it unaligned. So instead of `sub rsp, (16 + 32)` he needs to `sub rsp, (16 + 32 + 8)`. Or, maybe, `sub rsp, (8 + 32)` will do. I'm not sure myself... Looking at compiler listings, it seem that it does the former, at least with `/Od`. But maybe the latter can be used safely when making custom ASM? – ScienceDiscoverer Aug 07 '22 at 11:11
  • 1
    @ScienceDiscoverer: If you want to call any other functions following the Windows x64 ABI / calling convention, you need to get back to RSP % 16 == 0 before the call. If your custom asm function doesn't make any calls, then RSP alignment doesn't matter. – Peter Cordes Aug 07 '22 at 11:50
4

For completeness, I am posting this here as this is what I have ended up on. This works perfectly and as far as I can see, barring the UNWIND_INFO/Exception Handling requirements of x64 ASM on Windows, this is pretty much spot on. The comments are hopefully accurate too.

EDIT:

This is now updated after Raymonds comment below. I removed the preservation of rbp because it wasn't required and threw my stack alignment out further than I intended.

; Windows APIs

; GetStdHandle
; ------------
; HANDLE WINAPI GetStdHandle(
;     _In_ DWORD nStdHandle
; ); 
extern GetStdHandle

; WriteFile
; ------------
; BOOL WINAPI WriteFile(
;   _In_        HANDLE       hFile,
;   _In_        LPCVOID      lpBuffer,
;   _In_        DWORD        nNumberOfBytesToWrite,
;   _Out_opt_   LPDWORD      lpNumberOfBytesWritten,
;   _Inout_opt_ LPOVERLAPPED lpOverlapped
; );
extern WriteFile

; ExitProcess
; -----------
; VOID WINAPI ExitProcess(
;     _In_ UINT uExitCode
; );
extern ExitProcess

global start

section .data

    STD_OUTPUT_HANDLE   equ -11
    NULL                equ 0

    msg1                 db "Hello ", 0
    msg1.len             equ $-msg1

    msg2                 db "World!", 10, 0
    msg2.len             equ $-msg2

section .bss

empty               resd 1

section .text

start:

    sub rsp,0x28    ; Allocate 32 bytes of Shadow Space + align it to 16 bytes (8 byte return address already on stack, so 8 + 40 = 16*3)

    mov rcx,msg1
    mov rdx,msg1.len
    call write

    mov rcx,msg2
    mov rdx,msg2.len
    call write

    mov rcx,NULL
    call ExitProcess

    add rsp,0x28    ; Restore the stack pointer before exiting

    ret

write:

    ; Allocate another 40 bytes of stack space (the return address makes 48 total). Its 32
    ; bytes of Shadow Space for the WinAPI calls + 8 more bytes for the fifth argument
    ; to the WriteFile API call.
    sub rsp,0x28

    mov [rsp+0x30],rcx      ; Argument 1 is 48 bytes back in the stack (40 for Shadow Space above, 8 for return address)
    mov [rsp+0x38],rdx      ; Argument 2 is just after Argument 1

    mov rcx,STD_OUTPUT_HANDLE   ; Get handle to StdOut
    call GetStdHandle

    mov rcx,rax             ; hFile
    mov rdx,[rsp+0x30]      ; lpBuffer
    mov r8,[rsp+0x38]       ; nNumberOfBytesToWrite
    mov r9,empty            ; lpNumberOfBytesWritten

    ; Move the 5th argument directly behind the Shadow Space
   mov qword [rsp+0x20],0   ; lpOverlapped, Argument 5 (just after the Shadow Space 32 bytes back)
    call WriteFile

    add rsp,0x28        ; Restore the stack pointer (remove the Shadow Space)

    ret

Which results in...:

Finally working!

Simon Whitehead
  • 63,300
  • 9
  • 114
  • 138
  • Your `write` prologue does not establish 16-byte stack alignment. – Raymond Chen Oct 23 '15 at 19:06
  • @RaymondChen Again.. can you elaborate? I am trying to learn so if its off somehow I would like to understand why. I've thought this over a bit and as far as I understand it its correct - so if I'm wrong I have a big misunderstanding. I figure, the 32 bytes of Shadow Space + 8 bytes for the fifth argument to `WriteFile` makes 40. The return address being on the stack makes 48 - so does adding 40 bytes not align it? *confused* – Simon Whitehead Oct 24 '15 at 00:40
  • Having thought about it all day... it appears that I have forgotten completely about the `push rbp` in the prologue... which is yet another 8 bytes and is happening before the `sub rsp,0x28`. So really this bumps the stack out to 56 bytes which is not a multiple of 16. Therefore I should probably make it `sub rsp,0x30` to push it to 64 bytes. – Simon Whitehead Oct 24 '15 at 08:57
  • If the above is correct.. I could avoid storing `rbp` altogether (which is an optimization a compiler would make I guess). – Simon Whitehead Oct 24 '15 at 13:50
  • Your options include adjusting the `sub rsp` to account for the memory used by `push rbp`; or you could remove the `push rbp` entirely and then revise the code which uses `rbp` to access the parameters so that it uses some other mechanism. – Raymond Chen Oct 24 '15 at 14:39
  • Fantastic! So my last comment was correct. Thanks so much @RaymondChen. – Simon Whitehead Oct 24 '15 at 22:33
  • 1
    Basically it all seems to be based on the idea that using `push` complicates things by making the relationship between the SP and the stack frame different in different parts of the code, so should be avoided. That's probably oversimplistic, though. :-) – Harry Johnston Oct 27 '15 at 03:13
  • Thanks @HarryJohnston. I have managed to find quite a few more documents describing the implementation but none that actually demonstrate it. This was my original problem. I now understand it perfectly though and although the concept itself still seems strange as a requirement, its no longer quite the head scratcher it was. Thanks for pointing me to some information about it all :) – Simon Whitehead Oct 27 '15 at 11:15
  • I've never seen anyone use a symbolic `NULL` in asm. That's super-weird, and stops you from seeing that you can and should just `xor ecx, ecx` to zero it. 2 bytes instead of 5 or 7 bytes (depending on whether your assembler optimizes `mov rcx, 0` into `mov ecx, 0`), [and has other advantages](http://stackoverflow.com/questions/33666617/which-is-best-way-to-set-a-register-to-zero-in-x86-assembly-xor-mov-or-and). Also, you could tail-call optimize the `call/add/ret` at the end of both functions into `add/jmp`. – Peter Cordes Feb 25 '16 at 09:49
  • The param spilling looks correct, though. `write()` for the SysV ABI would probably `push` / `pop` rbx and rbp, and use them to preserve the function arguments across the `call`. Or spill the args directly, after reserving space on the stack. `push` is only 1 byte, and is just as fast as a `mov` on modern CPUs with a stack-engine, as long as you place it carefully to avoid extra synchronization uops. IIRC, I've seen clang push/pop a dummy register to obey the ABI stack alignment restriction for a `call`. – Peter Cordes Feb 25 '16 at 10:04
  • @PeterCordes I decided on a symbolic `NULL` just so it was visibly similar to the WinAPI documentation I was reading. If its weird, its because I am incredibly inexperienced in the ASM space (especially on the Windows side) so I am sorry about that! The tail-call optimizations are interesting - I didn't consider those either. Thanks for the comments! – Simon Whitehead Feb 26 '16 at 01:09
  • @SimonWhitehead: You don't need to apologize for inexperience. I can see the motivation for a symbolic NULL, but `mov reg, 0` instead of `xor` is a pet peeve of mine. A lot of people don't realize how much difference there is. IMO, knowing asm is mostly useful for reading compiler output to see if it did a good job, or to see what a microbenchmark is actually testing. Also for debugging / profiling. And there's no point writing asm is unless you optimize the crap out of it, so I like to give suggestions in that direction. :) Glad I could help you out. – Peter Cordes Feb 26 '16 at 01:20