How to set function arguments in assembly during runtime in a 64bit application on Windows?

Question

I am trying to set arguments using assembly code that are used in a generic function. The arguments of this generic function - that is resident in a dll - are not known during compile time. During runtime the pointer to this function is determined using the GetProcAddress function. However its arguments are not known. During runtime I can determine the arguments - both value and type - using a datafile (not a header file or anything that can be included or compiled). I have found a good example of how to solve this problem for 32 bit (C Pass arguments as void-pointer-list to imported function from LoadLibrary()), but for 64 bit this example does not work, because you cannot fill the stack but you have to fill the registers. So I tried to use assembly code to fill the registers but until now no success. I use C-code to call the assembly code. I use VS2015 and MASM (64 bit). The C-code below works fine, but the assembly code does not. So what is wrong with the assembly code? Thanks in advance.

C code:

...
void fill_register_xmm0(double); // proto of assembly function
...
// code determining the pointer to a func returned by the GetProcAddress()
...
double dVal = 12.0;
int v;

fill_register_xmm0(dVal);
v = func->func_i(); // integer function that will use the dVal
...

assembly code in different .asm file (MASM syntax):

TITLE fill_register_xmm0

.code
option prologue:none ; turn off default prologue creation
option epilogue:none ; turn off default epilogue creation
fill_register_xmm0 PROC variable: REAL8  ; REAL8=equivalent to double or float64

movsd  xmm0, variable  ; fill value of variable into xmm0

ret

fill_register_xmm0 ENDP
option prologue:PrologueDef ; turn on default prologue creation
option epilogue:EpilogueDef ; turn on default epilogue creation

END

you need call `func->func_i` direct from asm code if you setup registers in asm. unclear which type and count of arguments you use. anyway first 4 args passed by register (*xmm0-3* for double or *rcx,rdx,r8,r9* for integer types, other begin from `[rsp+20h]` and stack must be aligned on 16 before call instruction — RbMm, Mar 14 '18 at 17:17
The C compiler uses the registers. It can't possibly work reliably to make a bunch of function calls like `fill_register_xmm0` and hope that the compiler doesn't clobber any of those registers. And BTW, `movsd xmm0, variable` probably assembles to `movsd xmm0, xmm0`, because the first function arg is passed in XMM0 if it's FP. — Peter Cordes, Mar 15 '18 at 04:26

score 2 · Answer 1 · answered Mar 20 '18 at 02:14

The x86-64 Windows calling convention is fairly simple, and makes it possible to write a wrapper function that doesn't know the types of anything. Just load the first 32 bytes of args into registers, and copy the rest to the stack.

You definitely need to make the function call from asm; It can't possibly work reliably to make a bunch of function calls like fill_register_xmm0 and hope that the compiler doesn't clobber any of those registers. The C compiler emits instructions that use the registers, as part of its normal job, including passing args to functions like fill_register_xmm0.

The only alternative would be to write a C statement with a function call with all the args having the correct type, to get the compiler to emit code to make a function call normally. If there are only a few possible different combinations of args, putting those in if() blocks might be good.

And BTW, movsd xmm0, variable probably assembles to movsd xmm0, xmm0, because the first function arg is passed in XMM0 if it's FP.

In C, prepare a buffer with the args (like in the 32-bit case).

Each one needs to be padded to 8 bytes if it's narrower. See MS's docs for x86-64 __fastcall. (Note that x86-64 __vectorcall passes __m128 args by value in registers, but for __fastcall it's strictly true that the args form an array of 8-byte values, after the register args. And storing those into the shadow space creates a full array of all the args.)

Any argument that doesn’t fit in 8 bytes, or is not 1, 2, 4, or 8 bytes, must be passed by reference. There is no attempt to spread a single argument across multiple registers.

But the key thing that makes variadic functions easy in the Windows calling convention also works here: The register used for the 2nd arg doesn't depend on the type of the first. i.e. if an FP arg is the first arg, then that uses up an integer register arg-passing slot. So you can only have up to 4 register args, not 4 integer and 4 FP.

If the 4th arg is integer, it goes in R9, even if it's the first integer arg. Unlike in the x86-64 System V calling convention, where the first integer arg goes in rdi, regardless of how many earlier FP args are in registers and/or on the stack.

So the asm wrapper that calls the function can load the first 8 bytes into both integer and FP registers! (Variadic functions already require this, so a callee doesn't have to know whether to store the integer or FP register to form that arg array. MS optimized the calling convention for simplicity of variadic callee functions at the expense of efficiency for functions with a mix of integer and FP args.)

The C side that puts all the args into a buffer can look like this:

#include  <stdalign.h>
int asmwrapper(const char *argbuf, size_t argp-argbuf, void (*funcpointer)(...));

void somefunc() {
    alignas(16) uint64_t argbuf[256/8];  // or char argbuf[256].  But if you choose not to use alignas, then uint64_t will still give 8-byte alignment

    char *argp = (char*)argbuf;
    for( ; argp < &argbuf[256] ; argp += 8) {
        if (figure_out_an_arg()) {
            int foo = get_int_arg();
            memcpy(argp, &foo, sizeof(foo));
        } else if(bar) {
            double foo = get_double_arg();
            memcpy(argp, &foo, sizeof(foo));
        } else
           ... memcpy whatever size
           // or allocate space to pass by ref and memcpy a pointer
    }
    if (argp == &argbuf[256]) {
        // error, ran out of space for args
    }

    asmwrapper(argbuf, argp-argbuf, funcpointer);
}

Unfortunately I don't think we can directly use argbuf on the stack as the args + shadow space for a function call. We have no way of stopping the compiler from putting something valuable below argbuf which would let us just set rsp to the bottom of it (and save the return address somewhere, maybe at the top of argbuf by reserving some space for use by the asm).

Anyway, just copying the whole buffer will work. Or actually, load the first 32 bytes into registers (both integer and FP), and only copy the rest. The shadow space doesn't need to be initialized.

argbuf could be a VLA if you knew ahead of time how big it needed to be, but 256 bytes is pretty small. It's not like reading past the end of it can be a problem, it can't be at the end of a page with unmapped memory later, because our parent function's stack frame definitely takes some space.

;; NASM syntax.  For MASM just rename the local labels and add whatever PROC / ENDPROC is needed.
;; UNTESTED

   ;; rcx: argbuf
   ;; rdx: length in bytes of the args.  0..256, zero-extended to 64 bits
   ;; r8 : function pointer

   ;; reserve rdx bytes of space for arg passing
   ;; load first 32 bytes of argbuf into integer and FP arg-passing registers
   ;; copy the rest as stack-args above the shadow space
global asmwrapper
asmwrapper:
    push  rbp
    mov   rbp, rsp    ; so we can efficiently restore the stack later

    mov   r10, r8     ; move function pointer to a volatile but non-arg-passing register

    ; load *both* xmm0-3 and rcx,rdx,r8,r9 from the first 32 bytes of argbuf
    ; regardless of types or whether there were that many arg bytes
    ; All bytes are loaded into registers early, some reg->reg transfers are done later
    ; when we're done with more registers.

   ; movsd    xmm0, [rcx]
   ; movsd    xmm1, [rcx+8]

    movaps   xmm0, [rcx]    ; 16-byte alignment required for argbuf.  Use movups to allow misalignment if you want
    movhlps  xmm1, xmm0     ; use some ALU instructions instead of just loads
    ; rcx,rdx can't be set yet, still in use for wrapper args

    movaps   xmm2, [rcx+16]   ; it's ok to leave garbage in the high 64-bits of an XMM passing a float or double.
    ;movhlps  xmm3, xmm2      ; the copyloop uses xmm3: do this later
    movq     r8, xmm2
    mov      r9, [rcx+24]

    mov      eax, 32
    cmp      edx, eax
    jbe    .small_args      ; no copying needed, just shadow space

    sub   rsp, rdx
    and   rsp, -16     ; reserve extra space, realigning the stack by 16

    ; rax=32 on entry, start copying just above shadow space (which doesn't need to be copied)
 .copyloop:                   ; do {
    movaps   xmm3, [rcx+rax]
    movaps   [rsp+rax], xmm3   ; indexed addressing modes aren't always optimal, but this loop only runs a couple times.
    add      eax, 16
    cmp      eax, edx
    jb    .copyloop            ; } while(bytes_copied < arg_bytes);

  .done_arg_copying:

    ; xmm0,xmm1 have the first 2 qwords of args
    movq     rcx, xmm0       ; RCX NO LONGER POINTS AT argbuf
    movq     rdx, xmm1

    ; xmm2 still has the 2nd 16 bytes of args
    ;movhlps  xmm3, xmm2       ; don't use: false dependency on old value and we just used it.
    pshufd   xmm3, xmm2, 0xee  ; xmm3 = high 64 bits of xmm2.  (0xee = _MM_SHUFFLE(3,2,3,2))
    ; movq   xmm3, r9          ; nah, can be multiple uops on AMD
    ; r8,r9 set earlier

    call  r10
    leave             ; restore RSP to its value on entry
    ret

; could handle this branchlessly, but copy loop still needs to run zero times
; unless we bump up the min arg_bytes to 48 and sometimes copy an unnecessary 16 bytes
; As much work as possible is before the first branch, so it can happen while a mispredict recovers
.small_args:
    sub  rsp, rax      ; reserve shadow space
    ;rsp still aligned by 16 after push rbp
    jmp  .done_arg_copying

;byte count.  This wrapper is 82 bytes; would be nice to fit it in 80 so we don't waste 14 bytes before the next function.
;e.g. maybe mov rcx, [rcx] instead of  movq  rcx, xmm0
;mov eax, $-asmwrapper
align 16

This does assemble (on Godbolt with NASM), but I haven't tested it.

It should perform pretty well, but if you get mispredicts around the cutoff from <= 32 bytes to > 32 bytes, change the branching so it always copies an extra 16 bytes. (Uncomment the cmp/cmovb in the version on Godbolt, but the copy loop still needs to start at 32 bytes into each buffer.)

If you often pass very few args, the 16-byte loads might hit a store-forwarding stall from two narrow stores to one wide reload, causing about an extra 8 cycles of latency. This isn't normally a throughput problem, but it can increase the latency before the called function can access its args. If out-of-order execution can't hide that, then it's worth using more load uops to load each 8-byte arg separately. (Especially into integer registers, and then from there to XMM, if the args are mostly integer. That will have lower latency than mem -> xmm -> integer.)

If you have more than a couple args, though, hopefully the first few have committed to L1d and no longer need store forwarding by the time the asm wrapper runs. Or there's enough copying of later args that the first 2 args finish their load + ALU chain early enough not to delay the critical path inside the called function.

Of course, if performance was a huge issue, you'd write the code that figures out the args in asm so you didn't need this copy stuff, or use a library interface with a fixed function signature that a C compiler can call directly. I did try to make this suck as little as possible on modern Intel / AMD mainstream CPUs (http://agner.org/optimize/), but I didn't benchmark it or tune it, so probably it could be improved with some time spent profiling it, especially for some real use-case.

If you know that FP args aren't a possibility for the first 4, you can simplify by just loading integer regs.

You have done quite a bit of work here! I think the OP will appreciate it. — Paul Ogilvie, Mar 20 '18 at 09:06

score 1 · Answer 2 · answered Mar 14 '18 at 17:21

So you need to call a function (in a DLL) but only at run-time can you figure out the number and type of parameters. Then you need to perpare the parameters, either on the stack or in registers, depending on the Application Binary Interface/calling convention.

I would use the following approach: some component of your program figures out the number and type of parameters. Let's assume it creates a list of {type, value}, {type, value}, ...

You then pass this list to a function to prepare the ABI call. This will be an assembler function. For a stack-based ABI (32 bit), it just pushes the parameters on to the stack. For a register based ABI, it can prepare the register values and save them as local variables (add sp,nnn) and once all parameters have been prepared (possibly using registers needed for the call, hence first saving them), loads the registers (a series of mov instructions) and performs the call instruction.

You don't actually need type info for the Windows x86-64 calling convention. You can prepare a flat buffer of args and pass it to a wrapper function, which loads the first 32 bytes into both XMM and integer registers and copies the rest onto the stack. I eventually finished writing an answer for this with a hopefully-working asm wrapper. — Peter Cordes, Mar 20 '18 at 02:17

How to set function arguments in assembly during runtime in a 64bit application on Windows?

2 Answers2

Linked