Optimizing a C function call using 64-bit MASM

Question

Currently using this 64-bit MASM code to call a C runtime function such as memcmp(). I recall this convention was from a GoAsm article on optimizations.

              memcmp          PROTO;:QWORD,:QWORD,:QWORD
              PUSH            RSP
              PUSH            QWORD PTR [RSP]
              AND             SPL,0F0h
              MOV             R8,R11
              MOV             RDX,R10
              MOV             RCX,RAX
              SUB             RSP,32
              CALL            memcmp
              LEA             RSP,[RSP+40]
              POP             RSP

Is this a valid optimized version below?

              memcmp          PROTO;:QWORD,:QWORD,:QWORD
              PUSH            RSP
              PUSH            QWORD PTR [RSP]
              AND             RSP,-16        ; new
              MOV             R8,R11
              MOV             RDX,R10
              MOV             RCX,RAX
              LEA             RSP,[RSP-32]   ; new
              CALL            memcmp
              LEA             RSP,[RSP+40]
              POP             RSP

The justification for replacing

              AND             SPL,0F0h

with

              AND             RSP,-16

is that it avoids invoke partial register updates. Understanding fastcall stack frame

Replacing

              SUB             RSP,32

with

              LEA             RSP,[RSP-32]

is that ensuing instructions do not depend on the flags being updated by the subtraction
then not updating the flags will be more efficient as well.
Why does GCC emit "lea" instead of "sub" for subtraction?

In this case, are there other optimization tricks too?

`AND` yes, the original code was silly and not saving any code-size (SPL takes a REX prefix). `LEA` - pointless and a waste of code-size: x86 CPUs already avoid false dependencies on FLAGS via register renaming. The answer on that linked Q&A is wrong. — Peter Cordes, Aug 14 '21 at 23:43
@PeterCordes _Side note:_ Thanks for seeing this. I was going to link a comment under one of your responses to alert you to this but decided to wait a bit because I figured you'd find it based on your normal M/O. — Craig Estey, Aug 14 '21 at 23:47

Peter Cordes · Accepted Answer · 2021-08-15T00:08:57.590

AND yes, the original code was silly and not saving any code-size (SPL takes a REX prefix, too, like 64-bit operand-size).

LEA - pointless and a waste of code-size: x86 CPUs already avoid false dependencies on FLAGS via register renaming; that's necessary to efficiently run normal x86 code which is full of instructions like add, sub, and, etc. Compilers would use lea much more heavily if that wasn't the case. The answer on that linked Q&A is wrong and should be downvoted / deleted. The only danger is on a few less-common CPUs (Pentium 4 and Silvermont for different reasons) from instructions like inc that only write some flags. (INC instruction vs ADD 1: Does it matter?). Even the cost of inc on Silvermont-family is pretty minor, just an extra uop but not during decode, so it doesn't stall.

add is not slower than lea on any CPUs, either itself or in its influence on later instructions. (Except in-order Atom pre-Silvermont, where lea ran earlier in the pipeline than add (on an actual AGU), so it could be better or worse depending on where data was coming from / going to). You'd only use lea in some cases like an adc loop where you actually need to keep CF unchanged so next iteration can read it. i.e. to not mess up a true dependency (RAW), nothing to do with avoiding a false (WAW) output dependency. (See Problems with ADC/SBB and INC/DEC in tight loops on some CPUs - note that cases where adc / inc / adc creates a partial-flag stall are cases where add would cause a correctness problem, so I'm not counting that as a case where add would make later instructions faster.)

You probably don't need to save the old RSP; the ABI requires 16-byte stack alignment before a call, and that includes your caller (unless you're getting called from code that doesn't follow the ABI, so you don't have known RSP alignment relative to a 16-byte boundary).

Normally you'd just do sub rsp, 40 like a compiler would, to realign RSP and reserve space for the shadow space. (And you'd do this at the top/bottom of the function, not around every call, along with saving/restoring call-preserved registers).

(In practice memcmp is unlikely to care about stack alignment, unless it needs to save/restore some more XMM regs. The Windows x64 calling convention unwisely only has 6 call-clobbered x/ymm registers, and that might be slightly tight depending on how much loop unrolling they do in a hand-written(?) memcmp.)

And even if you did need to handle an unknown incoming RSP alignment, saving RSP to two different locations for pop rsp is still not a very efficient way to go about it. Normally you'd just use RBP to make a traditional frame pointer to clean up with mov rsp, rbp / pop rbp, which works regardless of unknown adjustment to RSP. e.g. even in functions that use alloca (or in asm, that do an unknown number of pushes or variable-sized sub rsp, which is effectively the same thing as and rsp, -16).

Edited the question to show where the original assembly code was modified from. GoAsm also uses **PUSH [RSP] ;keep another copy of that on the stack** so I assumed there was a good reason to keep it. (not sure why myself) — vengy, Aug 15 '21 at 00:26
@vengy: I assumed that was only so `pop rsp` would pop the original RSP, regardless of whether `and rsp, -16` changed RSP by 8 or 0. i.e. place copies of RSP in both places it might reload from. Like I said, seems inefficient. (Especially `push [rsp]` is reloading from memory, unlike `mov rax, rsp` / `push rax` / `push rax` would have been less bad. Even `push rsp` is 2 uops on AMD Zen CPUs, and 3 on Intel, according to Agner Fog's testing (https://agner.org/optimize/).) — Peter Cordes, Aug 15 '21 at 00:29
Good point. Also, I recall **LEA RSP,[RSP+40]** was used because I had run into an issue that after the **CALL TheAPI**, there was a **SUB RSP,40** but that affected some flags and caused issues downstream. In your answer, could you show the final suggested ASM code? — vengy, Aug 15 '21 at 00:34
@vengy: This is supposed to be part of a larger function, but I don't know whether it needs an odd or even amount of stack space, or which / how many call preserve registers saved/restored. The actual sequence for calling `memcpy` shouldn't involve the stack at all except for `call` itself (and can be what's left from your code after removing the stack manipulation), with all push/pop and RSP manipulation happening at the start/end of the whole function that contains this, to reserve enough stack space for locals plus shadow space for callees. — Peter Cordes, Aug 15 '21 at 00:43
@vengy: `SUB RSP,40` after the call would be obviously wrong; use `add rsp, 40` instead like your LEA is doing. Unless TheAPI returns a value in FLAGS you need to preserve, you can use `add` even though it overwrites FLAGS. Function-calls must be assumed to overwrite FLAGS, so you can't `cmp eax, ecx` / `call foo` / `je`. But like I said, don't do anything to the stack until the very end of the function that contains the `call`, not right after the `call`. — Peter Cordes, Aug 15 '21 at 00:45
I guess I'm looking for reassurance that the optimizations I stated between the old and new code w.r.t to AND and LEA is valid. But, based on your answer, using the new AND is good, but replacing the SUB with LEA is pointless. — vengy, Aug 15 '21 at 00:49
@vengy: Using the new `AND` is valid, but unless you can't trust your own caller to follow the ABI, removing `AND` altogether would be even better. Look at optimizer compiler output for an example, if you write a function that takes args, calls `memcpy` a couple different ways, and returns some combination of the return values. (So it doesn't get optimized away or into a tailcall.) [How to remove "noise" from GCC/clang assembly output?](https://stackoverflow.com/q/38552116). (Mostly applies to MSVC as well if you want to look at MSVC output on https://godbolt.org/) — Peter Cordes, Aug 15 '21 at 00:52
Gradually I'm switching to C, so I won't have to deal with the ABI's. Thanks for your time and comments! — vengy, Aug 15 '21 at 00:57

Optimizing a C function call using 64-bit MASM

1 Answers1