0

It appears that the usual approach to calling printf from aarch64 asm code that works just fine on Linux does not work on MacOS running on the Apple M1.

Is there any documentation that explains what has changed?

I find that the parameters that I put in x0..x2 are getting garbled in the printf output.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 2
    Show your code. Haw can we answer without seeing your code? – 0___________ Oct 05 '21 at 16:42
  • `printf` does not take arguments in x1 or x2... – Siguza Oct 05 '21 at 17:07
  • 1
    I found https://developer.apple.com/documentation/xcode/writing-arm64-code-for-apple-platforms which does document some differences from the usual ARM64 conventions. Of course, you can also compile a `printf` call with your C compiler on MacOS and see how it looks. – Nate Eldredge Oct 05 '21 at 17:20
  • 2
    In particular, if I'm reading right, it expects all variadic arguments to be passed on the stack? I didn't know that. If that's so then @Siguza is right: the format string pointer would go in `x0` and everything else on the stack. – Nate Eldredge Oct 05 '21 at 17:22

1 Answers1

4

The Darwin arm64 ABI passes all varags arguments on the stack, each padded to the next multiple of 8 bytes. (Types that don't fit into 8 bytes have a pointer passed instead. Regular arguments that don't fit into x0-x7/q0-q7 come before varargs on the stack, naturally aligned.)

Here's a simple example:

.globl _main
.align 2
_main:
    stp x29, x30, [sp, -0x10]!
    sub sp, sp, 0x10

    mov x8, 66
    str x8, [sp]
    adr x0, Lstr
    bl _printf

    mov w0, 0
    add sp, sp, 0x10
    ldp x29, x30, [sp], 0x10
    ret

Lstr:
    .asciz "test: %x\n"

Note that this is different from non-varargs arguments to unprototyped functions that are passed on the stack, which are only padded up to 4 bytes (sizeof(int)). The following code:

#include <stdio.h>
#include <stdint.h>

extern void func();
__asm__
(
    "_func:\n"
    "    ret\n"
);

int main(void)
{
    uint8_t a = 1,
            b = 2,
            c = 3;
    printf("%hhx %hhx %hhx %hhx %hhx %hhx\n", a, b, c, a, b, c);
    func(a, b, c, a, b, c, a, b, c, a, b, c);
    return 0;
}

compiles down to this with -O2:

;-- _main:
0x100003ee8      ff0301d1       sub sp, sp, 0x40
0x100003eec      fd7b03a9       stp x29, x30, [sp, 0x30]
0x100003ef0      fdc30091       add x29, sp, 0x30
0x100003ef4      68008052       mov w8, 3
0x100003ef8      49008052       mov w9, 2
0x100003efc      e92302a9       stp x9, x8, [sp, 0x20]
0x100003f00      2a008052       mov w10, 1
0x100003f04      e82b01a9       stp x8, x10, [sp, 0x10]
0x100003f08      ea2700a9       stp x10, x9, [sp]
0x100003f0c      20040010       adr x0, str._hhx__hhx__hhx__hhx__hhx__hhx_n
0x100003f10      1f2003d5       nop
0x100003f14      13000094       bl sym.imp.printf
0x100003f18      480080d2       mov x8, 2
0x100003f1c      6800c0f2       movk x8, 3, lsl 32
0x100003f20      690080d2       mov x9, 3
0x100003f24      2900c0f2       movk x9, 1, lsl 32
0x100003f28      e92300a9       stp x9, x8, [sp]
0x100003f2c      20008052       mov w0, 1
0x100003f30      41008052       mov w1, 2
0x100003f34      62008052       mov w2, 3
0x100003f38      23008052       mov w3, 1
0x100003f3c      44008052       mov w4, 2
0x100003f40      65008052       mov w5, 3
0x100003f44      26008052       mov w6, 1
0x100003f48      47008052       mov w7, 2
0x100003f4c      e6ffff97       bl sym._func
0x100003f50      00008052       mov w0, 0
0x100003f54      fd7b43a9       ldp x29, x30, [sp, 0x30]
0x100003f58      ff030191       add sp, sp, 0x40
0x100003f5c      c0035fd6       ret

Giving the function an actual prototype allows the removal of any padding (except the one that serves alignment purposes), like so (note the last argument being 8 bytes):

extern void func(uint8_t, uint8_t, uint8_t, uint8_t, uint8_t, uint8_t,
                 uint8_t, uint8_t, uint8_t, uint8_t, uint8_t, uint64_t);

The code then compiles down to:

;-- _main:
0x100003ee4      ff4301d1       sub sp, sp, 0x50
0x100003ee8      f44f03a9       stp x20, x19, [sp, 0x30]
0x100003eec      fd7b04a9       stp x29, x30, [sp, 0x40]
0x100003ef0      fd030191       add x29, sp, 0x40
0x100003ef4      73008052       mov w19, 3
0x100003ef8      54008052       mov w20, 2
0x100003efc      f44f02a9       stp x20, x19, [sp, 0x20]
0x100003f00      28008052       mov w8, 1
0x100003f04      f32301a9       stp x19, x8, [sp, 0x10]
0x100003f08      e85300a9       stp x8, x20, [sp]
0x100003f0c      20040010       adr x0, str._hhx__hhx__hhx__hhx__hhx__hhx_n
0x100003f10      1f2003d5       nop
0x100003f14      13000094       bl sym.imp.printf
0x100003f18      68208052       mov w8, 0x103
0x100003f1c      f30700f9       str x19, [sp, 8]
0x100003f20      f40b0039       strb w20, [sp, 2]
0x100003f24      e8030079       strh w8, [sp]
0x100003f28      20008052       mov w0, 1
0x100003f2c      41008052       mov w1, 2
0x100003f30      62008052       mov w2, 3
0x100003f34      23008052       mov w3, 1
0x100003f38      44008052       mov w4, 2
0x100003f3c      65008052       mov w5, 3
0x100003f40      26008052       mov w6, 1
0x100003f44      47008052       mov w7, 2
0x100003f48      e6ffff97       bl sym._func
0x100003f4c      00008052       mov w0, 0
0x100003f50      fd7b44a9       ldp x29, x30, [sp, 0x40]
0x100003f54      f44f43a9       ldp x20, x19, [sp, 0x30]
0x100003f58      ff430191       add sp, sp, 0x50
0x100003f5c      c0035fd6       ret
Siguza
  • 21,155
  • 6
  • 52
  • 89
  • I was a little bit confused by the remark in [Apple's ABI notes](https://developer.apple.com/documentation/xcode/writing-arm64-code-for-apple-platforms): "The C language requires the promotion of arguments smaller than int before a call. Beyond that, the Apple platforms ABI doesn’t add unused bytes to the stack." Above that, it seems that for non-variadic functions, arguments of less than 8 bytes can be packed, e.g. two `int` arguments would share an 8-byte stack slot instead of getting their own. I'm not sure if this remark means that the same applies to variadic functions or not. – Nate Eldredge Oct 05 '21 at 18:53
  • So if we do `printf("%d %d\n", a, b)`, with `a,b` being `int`, does the second one go at `[sp+4]` or `[sp+8]`? – Nate Eldredge Oct 05 '21 at 18:55
  • (I wish godbolt would support an ARM64 MacOS target so I could test for myself...) – Nate Eldredge Oct 05 '21 at 18:58
  • @NateEldredge: packing more than 1 arg into an arg passing "slot" of register width would be really surprising and unusual. Agreed that the wording you quote does seem to imply that, though. – Peter Cordes Oct 05 '21 at 19:19
  • 1
    @PeterCordes: The text definitely shows that happening for non-variadic functions. "The following example illustrates how Apple platforms specify stack-based arguments that are not multiples of 8 bytes. On entry to the function, s0 occupies one byte at the current stack pointer (sp), and s1 occupies one byte at sp+1. The compiler still adds padding after s1 to satisfy the stack’s 16-byte alignment requirements." This seems to be one of the major variances from AAPCS. – Nate Eldredge Oct 05 '21 at 19:20
  • @NateEldredge you were right about me overwriting x29, thanks. But yes, varargs do indeed have different padding than non-varargs arguments that are passed on the stack. – Siguza Oct 05 '21 at 19:40
  • In your updated example, you're showing default promotions mandated by ISO C for an unprototyped function. Can stack args be even narrower for functions with prototypes, or does the calling convention still widen them to 32-bit even though in C abstract-machine terms they're still `uint8_t` args? In Nate's link, I found "*The caller of a function is responsible for signing or zero-extending any argument with fewer than 32 bits. The standard ABI expects the callee to sign or zero-extend those arguments.* so probably 32-bit minimum, for a separate reason than ISO C default arg promotions. – Peter Cordes Oct 05 '21 at 22:48
  • Semi-related: clang on x86-64 SysV [relies on narrow args having been correctly extended by the caller](https://stackoverflow.com/questions/36706721/is-a-sign-or-zero-extension-required-when-adding-a-32bit-offset-to-a-pointer-for/36760539#36760539), even though that ABI doc does *not* require that. GCC implements that for callers but does *not* rely on it in callees. ICC doesn't do it for callers, so isn't fully compatible with clang/LLVM on x86-64 Linux (or presumably MacOS). So anyway, this x86-64 calling convention extension idea may have come from Apple LLVM development. – Peter Cordes Oct 05 '21 at 22:55
  • 1
    @PeterCordes updated my answer - stack args can be narrower if given a prototype. – Siguza Oct 05 '21 at 22:55
  • Thanks. I'd suggest linking the ABI doc in your answer. And if you have time, it would be great if you could cite which portions of it specify which behaviour. I only very quickly skimmed / searched it, so maybe that'd be obvious if I actually read it, but is extending to 32-bit a thing at all when there are prototypes? Like does an `int8_t(0xff)` get *sign* extended to `0xffffffff` if it gets a whole register to itself instead of a narrow stack slot? – Peter Cordes Oct 05 '21 at 23:00
  • 1
    @PeterCordes yes: if you write a function that takes an `int8_t` and compare that `< 0`, clang simply emits `cmp w0, 0`, so it assumes the value to have been sign-extended. If the same argument is passed on the stack, it emits `ldrsb`. But this isn't really clear from the ABI doc... I was hoping that the clang repo would have a more explicit spec, but if it does, I haven't found it. :/ – Siguza Oct 06 '21 at 18:25
  • Thanks. Maybe it's just like x86-64 then, where extension to 32 is an unofficial undocumented extension to the calling convention that LLVM depends on. – Peter Cordes Oct 06 '21 at 18:30