4

Why does al contain the number of vector parameters in assembly?

Why are vector parameters any different from normal parameters for the callee?

Riolku
  • 572
  • 1
  • 4
  • 10
  • 1
    Note, that's only for varargs functions and it's not `eax` just `al`. I guess it's to allow generic thunks to process the appropriate number of vector registers e.g. to save space. The spec says that it does _"not need to match exactly the number of registers, but must be an upper bound on the number of vector registers used"_. – Jester Jul 24 '18 at 00:04
  • Note that it's not the number of vector args, it's the number that are passed in XMM/YMM/ZMM regs. You can have an unlimited number of `__m128` args, but beyond the first 8 they're passed on the stack. I *think* it's supposed to work to pass FP args on the stack and set `al=0`, but I haven't tested. gcc's code-gen for variadic functions just checks `al!=0` and stores all of xmm0..7 to an array on the stack which it can index, so maybe the first 8 FP / vector args do need to go in vector regs. – Peter Cordes Jul 24 '18 at 00:14
  • If i remember correctly, arguments are just passed in registers, with extra arguments passed on the stack. if this is the case, why do i neeed to pass it the number of FP/ vector args? Shoudlnt the Callee be able to access what it knows it needs to access or is this more important for variadic args? – Riolku Jul 24 '18 at 00:25
  • Sometimes that information is not available otherwise, e.g. in generic thunks or hooks. – Jester Jul 24 '18 at 00:32
  • how does the compiler know how many of the varargs parameters are going to be passed in XMM registers? – Riolku Jul 24 '18 at 00:51
  • 1
    It's the compiler that's loading the registers (in the caller) so of course it knows. It follows the rules in the ABI. – Jester Jul 24 '18 at 01:03
  • This guy says that [*This will make printf debugging hard without*](http://nickdesaulniers.github.io/blog/2014/04/18/lets-write-some-x86-64/) but I'm not sure why – phuclv Jul 24 '18 at 01:06
  • @Jester ya whoops :p, but why do these registers need toi be accounted for, but not other variadic registers, like if i pass a ton of ints, they dont have to be accounted for... – Riolku Jul 24 '18 at 01:09
  • A copy of AMD System V ABI documentation can be found here: https://www.uclibc.org/docs/psABI-x86_64.pdf I guess the reason is to add flexibility. When AL is 8 argument passing is essentially same as regular function. When AL is 0 floats are always passed on stack. – W. Chang Jul 24 '18 at 01:16
  • oh, so thats how that works. Thats really neat, and helpful, too. Is there any way to do that with other parameters? Say, a bunch of ints? – Riolku Jul 24 '18 at 01:54
  • The first 6 integer class values are passed in registers. RDI, RSI, RDX, RCX, R8, R9 in that order. The remainder are pushed on the stack (right to left) – Michael Petch Jul 24 '18 at 02:24
  • @W.Chang: I just tested; no, you can't set AL=0 and have printf read all the floats from the stack. I set up `[rsp] = 1.0`, `xmm0 = 2.0`, and called printf with `al=0`. It printed `0.0000`, presumably because it still indexed its local stack memory where it would have dumped xmm0..7 on function entry, and that memory was still all zero because AL=0 so XMM registers weren't dumped. I had thought the ABI wording implied you could do this, but *that* would have been overcomplicated vs. a fixed XMM / stack cutoff for code-gen for variadic functions. – Peter Cordes Jul 24 '18 at 04:11
  • @PeterCordes Thank you for testing it. My guess was wrong. phuclv gave the answer below. – W. Chang Jul 24 '18 at 05:58
  • @Riolku It looks neat at first, but it is actually quite a bad idea. Because if compilers implement a ABI differently, it destroys interoperability. To maintain interoperability, callee would have to to deal with multiple scenarios. As Peter said, it overcomplicates things. – W. Chang Jul 24 '18 at 06:06

1 Answers1

7

The value is used for optimization as stated in the ABI document

The prologue should use %al to avoid unnecessarily saving XMM registers. This is especially important for integer only programs to prevent the initialization of the XMM unit.

3.5.7 Variable Argument Lists - The Register Save Area. System V Application Binary Interface version 1.0

When you call va_start it'll save all the parameters passed in registers to the register save area

To start, any function that is known to use va_start is required to, at the start of the function, save all registers that may have been used to pass arguments onto the stack, into the “register save area”, for future access by va_start and va_arg. This is an obvious step, and I believe pretty standard on any platform with a register calling convention. The registers are saved as integer registers followed by floating point registers...

https://blog.nelhage.com/2010/10/amd64-and-va_arg/

But saving all 8 vector registers could be slow so the compiler may choose to optimize it using the value passed in al

... As an optimization, during a function call, %rax is required to hold the number of SSE registers used to hold arguments, to allow a varargs caller to avoid touching the FPU at all if there are no floating point arguments.

https://blog.nelhage.com/2010/10/amd64-and-va_arg/

Since you want to save at least the registers used, the value can be larger than the real number of used registers. That's why there's this line in the ABI

The contents of %al do not need to match exactly the number of registers, but must be an upper bound on the number of vector registers used and is in the range 0–8 inclusive.

You can see the effect from the prolog of ICC

    sub       rsp, 216                                      #5.1
    mov       QWORD PTR [8+rsp], rsi                        #5.1
    mov       QWORD PTR [16+rsp], rdx                       #5.1
    mov       QWORD PTR [24+rsp], rcx                       #5.1
    mov       QWORD PTR [32+rsp], r8                        #5.1
    mov       QWORD PTR [40+rsp], r9                        #5.1
    movzx     r11d, al                                      #5.1
    lea       rax, QWORD PTR [r11*4]                        #5.1
    lea       r11, QWORD PTR ..___tag_value_varstrings(int, ...).6[rip] #5.1
    sub       r11, rax                                      #5.1
    lea       rax, QWORD PTR [175+rsp]                      #5.1
    jmp       r11                                           #5.1
    movaps    XMMWORD PTR [-15+rax], xmm7                   #5.1
    movaps    XMMWORD PTR [-31+rax], xmm6                   #5.1
    movaps    XMMWORD PTR [-47+rax], xmm5                   #5.1
    movaps    XMMWORD PTR [-63+rax], xmm4                   #5.1
    movaps    XMMWORD PTR [-79+rax], xmm3                   #5.1
    movaps    XMMWORD PTR [-95+rax], xmm2                   #5.1
    movaps    XMMWORD PTR [-111+rax], xmm1                  #5.1
    movaps    XMMWORD PTR [-127+rax], xmm0                  #5.1
..___tag_value_varstrings(int, ...).6: 

It's essentially a Duff's device. The r11 register is loaded with the address after the xmm saving instructions, and then al*4 is subtracted from the result (since movaps XMMWORD PTR [rax-X], xmmX is 4 bytes long) to jump to the movaps instruction that we should run

As I see, other compilers always save all the vector registers, or don't save them at all, so they don't care about al's value and just check if it's zero

The general purpose registers are always saved, probably because it's cheaper to just move the 6 registers to memory instead of spending time for a condition check, address calculation and jump. As a result so you don't need a parameter for how many integers were passed in registers

Here is a similar question to yours. You can find more information in the below links

phuclv
  • 37,963
  • 15
  • 156
  • 475
  • One more thing: why don't we save the integer registers? Is it because the caller is responsible for that and or floating point registers are extra expensive to save? – Riolku Jul 24 '18 at 11:35
  • 1
    @Riolku the integer registers are saved, you can see that in the prolog. As I said in the last paragraph, the cost of 6 `mov` instructions to save them might be smaller than calculating which register to save, so just save them all – phuclv Jul 24 '18 at 11:39
  • Oh sorry. Great answer, that makes a lot of sense, thanks! – Riolku Jul 24 '18 at 11:53
  • You're quoting an old version of the ABI that erroneously said RAX needs to have the count. It's actually just AL, you must not assume that it's zero-extended to RAX (although many compilers do xor-zero RAX or use `mov eax,4` or whatever, when optimizing for size they will likely use 2-byte `mov al, 4` for non-zero numbers.) – Peter Cordes Nov 27 '20 at 02:27
  • 1
    Fun fact: old GCC used AL to save exactly the right number of XMM regs, with a computed jump IIRC, like you're showing for ICC. Current GCC just does `test al,al` / `jz` to maybe skip the 8x `movaps` instructions. – Peter Cordes Nov 27 '20 at 02:29