Use vectorcall
to let MSVC pass __m128i
values in XMM registers, if you pass by value instead of forcing it to memory by using a reference.
Windows x64's default fastcall
convention is bad for small function. (Small functions are bad in general because of function call overhead on optimizing the code around the call-site, and call
/ret
overhead.)
Your test function is broken because [rsp+60h]
in the callee is not the same address as [rsp+60h]
in the caller. The call instruction itself pushes an 8-byte return address.
movdqa
requires 16-byte alignment so your load fault. (The ABI requires the stack to be aligned by 16 before a call
.)
But you shouldn't actually be accessing it relative to rsp
at all: it's not passed as a stack-arg per se, but rather by reference using a pointer. When the first arg is an integer/pointer it goes in RCX. That's why you'll see the caller setting up RCX to hold a pointer to that stack space.
Let MSVC compile __m128i AsmTest(__m128i x){ return x; }
with optimization enabled and see where it loads from. https://godbolt.org/z/7pvWqa
movdqu xmm0, XMMWORD PTR [rcx]
ret
It uses movdqu
instead of movdqa
because MSVC would rather make your code run slow on old CPUs like Core 2 and K8/K10 than fault when you misalign a __m128i
. Apparently.
BTW, learning from compiler output is helpful when you know enough to understand why the compiler is doing what it's doing, and just need to check the details.
You should also look up documentation on the calling convention. See links in https://stackoverflow.com/tags/x86/info.