2

I'm porting an algorithm of mine to assembly for ml64, half for sport, half to see how much performance I can actually gain.

Anyways, currently I'm trying to understand the stack frame setup, in this example as far as I know:

push rbp        ; inherited, base pointer of caller, pushed on stack for storage
mov rbp, rsp    ; inherited, base pointer of the callee, moved to rbp for use as base pointer
sub rsp, 32     ; intel guide says each frame must reserve 32 bytes for the storage of the
                ; 4 arguments usually passed through registers
and spl, -16    ; 16 byte alignment?


mov rsp, rbp    ; put your base pointer back in the callee register
pop rbp         ; restore callers base pointer

The 2 things that I'm not getting is

  1. How does subtracting 32 from RSP do anything at all? As far as I know, other than for its duties going from one stack frame to another, its just another register, right? I suspect its for going into another stack frame rather than for use in the current one.

  2. What is SPL and why does masking it make something 16 byte aligned?

old_timer
  • 69,149
  • 8
  • 89
  • 168
user81993
  • 6,167
  • 6
  • 32
  • 64
  • 1) it allocates space (because stack top is pointed to by rsp by definition) 2) spl is the low 8 bits of rsp, so you can use that for aligning although normally you use rsp – Jester May 30 '17 at 18:54
  • You might also want to consider [Why is this C++ code faster than my hand written assembly?](https://stackoverflow.com/questions/40354978/why-is-this-c-code-faster-than-my-hand-written-assembly-for-testing-the-collat) :-) – Bo Persson May 31 '17 at 07:36

1 Answers1

3
push rbp        ;save non-volatile rbp
mov rbp, rsp    ;save old stack
sub rsp, 32     ;reserve space for 32 bytes of local variables = 8 integers
                ;or 4 pointers.
                ;this is per the MS/Intel guides. You can use this as temp
                ;storage for the parameters or for local variables.
and spl, -16    ;align stack by 16 bytes (for sse code)


mov rsp, rbp    ;restore the old stack
pop rbp         ;restore rbp

How does subtracting 32 from RSP do anything at all

RSP is the stack pointer, not just another register. Doing anything to it affects the stack. In this case it reserves 8x4 = 32 bytes of space on the stack for local variables to be placed in.

What is SPL and why does masking it make something 16 byte aligned?

The and rsp,-16 forces the four LSB's to zero. And because the stack grows down this aligns it by 16 bytes.
The alignment by 16 bytes is needed when using SSE code, which x64 uses for floating point math. Having 16 byte alignment allows the compiler to use the faster aligned SSE load and store instructions.
SPL is the lower 8 bits of RSP. Why the compiler chooses to do this makes no sense. Both instructions are 4 bytes and and rsp,-16 is strictly better, because it does not invoke partial register updates.

Disassembly:

0:  40 80 e4 f0       and    spl,-16   ;bad! partial register update.
4:  48 83 e4 f0       and    rsp,-16   ;good
8:  83 e4 f0          and    esp,-16   ;not possible will zero upper 32 bits of rsp

[RSP is] just another register, right?

No, RSP is magically special.
It points to the stack, which is where PUSH and POP instructions act upon.
All local variables and parameters (which do not fit into the registers) are stored in the stack.

Understanding fastcall

There is only one calling convention in X64. To make matters more confusing if you specify a calling convention other than __fastcall most compiler will remap it to __fastcall on X64.

Johan
  • 74,508
  • 24
  • 191
  • 319