Difference in ABI between x86_64 Linux functions and syscalls

Question

The x86_64 SysV ABI's function calling convention defines integer argument #4 to be passed in the rcx register. The Linux kernel syscall ABI, on the other hand, uses r10 for that same purpose. All other arguments are passed in the same registers for both functions and syscalls.

This leads to some strange things. Check out, for example, the implementation of mmap in glibc for the x32 platform (for which the same discrepancy exists):

00432ce0 <__mmap>:
  432ce0:       49 89 ca                mov    %rcx,%r10
  432ce3:       b8 09 00 00 40          mov    $0x40000009,%eax
  432ce8:       0f 05                   syscall

So all register are already in place, except we move rcx to r10.

I am wondering why not define the syscall ABI to be the same as the function call ABI, considering that they are already so similar.

In [another ABI answer](http://stackoverflow.com/a/35619528/224132), I dug up some links to amd64 mailing-list posts from AMD architects and Linux kernel developers before the first AMD64 silicon was released. There's some interesting stuff there, like the experimental results (from compiling SPECint and looking at code-size and number of instructions) that led to the x86-64 SysV ABI's choices for which register to use for what. — Peter Cordes, Jul 26 '16 at 09:42

score 10 · Accepted Answer · edited Jul 26 '16 at 08:15

The syscall instruction is intended to provide a quicker method of entering Ring-0 in order to carry out a system call. This is meant to be an improvement over the old method, which was to raise a software interrupt (int 0x80 on Linux).

Part of the reason the instruction is faster is because it does not change memory, or even change rsp to point at a kernel stack. Unlike a software interrupt, where the CPU is forced to allow the OS to resume operation without clobbering anything, for this command the CPU is allowed to assume the software is aware that something is happening here.

In particular, syscall stores two parts of the user-space state in registers. The RIP to return to after the call is stored in rcx, and the flags are stored in R11 (because RFLAGS is masked with a kernel-supplied value before entry to the kernel). This means that both those registers are clobbered by the instruction.

Since they are clobbered, the syscall ABI uses another register instead of rcx, hence the use of r10 for the 4th argument.

r10 is a natural choice, since in the x86-64 SystemV ABI it's not used for passing function args, and functions don't need to preserve their caller's value of r10. So a syscall wrapper function can mov %rcx, %r10 without any save/restore. This wouldn't be possible with any other register, for 6-arg syscalls and the SysV ABI's function calling convention.

BTW, the 32-bit system call ABI is also accessible with sysenter, which requires cooperation between user-space and kernel-space to allow returning to user-space after a sysenter. (i.e. storing some state in user-space before running sysenter). This is higher performance than int 0x80, but awkward. Still, glibc uses it (by jumping to user-space code in the vdso pages that the kernel maps into the address space of every process).

AMD's syscall is another approach to the same idea as Intel's sysenter: to make entry/exit from the kernel less expensive by not preserving absolutely everything.

It's more subtle than just replacing a couple stores with register-register moves. It doesn't change `rsp` to point to the kernel stack, so there would be no sane place to push anything it wanted to save. Kernel code at the entry point has to do that itself. (using `swapgs` enables `[gs:absolute_address]` to access per-task kernel data). The CPU doesn't also internally keep a kernel stack pointer to use for `syscall`, just a saved `gs` value. I think this is where the implementation-complexity reduction comes from. (And that `swapgs` is a separate instruction). — Peter Cordes, Jul 26 '16 at 05:21
The part about C/C++ not using `r10` is meaningless. The kernel is not allowed to assume which language is performing the call. — Shachar Shemesh, Jul 26 '16 at 06:50
I found a way to word it which doesn't mention static chain pointers at all, while still being completely accurate and not ignoring them (I think :). They're not relevant for wrapper functions, and are a distraction from the point here (esp. since most people have never heard of them, and I don't even know exactly what they are). I also cleaned up the comment thread. — Peter Cordes, Jul 26 '16 at 08:19

score 6 · Answer 2 · answered Jul 25 '16 at 21:42

6

AMD's syscall clobbers the rcx register, thus r10 is used instead.

answered Jul 25 '16 at 21:42

a3f

8,517
1
41
46

1

And `r10` is a pure scratch register: not used for function arg-passing, and not call-preserved. This lets other wrapper functions like dynamic-linker stubs use it as a temporary and still be able to tail-call with a `jmp` instead of a `call` / `pop` / `ret`. So `r10` is a good choice for syscalls. `syscall` / `sysret` also use r11. – Peter Cordes Jul 26 '16 at 05:15
Oops, I was thinking of `r11`. The ABI says `r10` is used for passing a "static chain pointer". C/C++ don't use that, so in practice `r10` is also a pure scratch register. – Peter Cordes Jul 26 '16 at 05:33
What is meaning of "clobber" in this context? – tuket Mar 10 '21 at 21:28

Difference in ABI between x86_64 Linux functions and syscalls

2 Answers2

Linked