When does Linux x86-64 syscall clobber %r8, %r9 and %r10?

Question

I have just browsed the Linux kernel source tree and read the file tools/include/nolibc/nolibc.h.

I saw the syscall in this file uses %r8, %r9 and %r10 in the clobber list.
Also there is a comment that says:

rcx and r8..r11 may be clobbered, others are preserved.

As far as I know, syscall only clobbers %rax, %rcx and %r11 (and memory).

Is there a real example of syscall that clobbers %r8, %r9 and %r10?

Now, I am a bit worried that I didn't put `%r8`, `%r9` and `%r10` in my syscall clobber list (inline Assembly) and it is currently running on production app. But I haven't seen any issue so far. — Ammar Faizi, Oct 10 '21 at 14:22
I am still pondering the syscall entry code, hopefully it is not true that `%r8...%r10` are clobbered. https://elixir.bootlin.com/linux/latest/source/arch/x86/entry/entry_64.S#L50 — Ammar Faizi, Oct 10 '21 at 14:58
Notice the `PUSH_AND_CLEAR_REGS` on line 107 and the `POP_REGS` on line 193. These save and restore r8-r10 so unless a system call explicitly messes with the copy on the stack, they will be preserved. One example is of course `exec` but that does not return so nothing to worry about :) — Jester, Oct 10 '21 at 15:20
Thanks for the answer and comment, I have just sent out a patch to fix it, let's see whether it's acceptable or not. https://lore.kernel.org/lkml/20211011040344.437264-1-ammar.faizi@students.amikom.ac.id/T/ — Ammar Faizi, Oct 11 '21 at 04:20

Peter Cordes · Accepted Answer · 2022-01-05T10:28:47.610

Only 32-bit system calls (e.g. via int 0x80) in 64-bit mode step on those registers, along with R11. (What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code?).

syscall properly saves/restores all regs including R8, R9, and R10, so user-space using it can assume they keep their values, except the RAX return value. (The kernel's syscall entry point even saves RCX and R11, but at that point they've already been overwritten by the syscall instruction itself with the original RIP and before-masking RFLAGS value.)

Those, with R11, are the non-legacy registers that are call-clobbered in the function-calling convention, so compiler-generated code for C functions inside the kernel naturally preserves R12-R15, even if an asm entry point didn't save them.

Currently the 64-bit int 0x80 entry point just pushes 0 for the call-clobbered R8-R11 registers in the process-state struct that it will restore from before returning to user space, instead of the original register values.

Historically, the int 0x80 entry point from 32-bit user-space didn't save/restore those registers at all. So their values were whatever compiler-generated kernel code left sitting around. This was thought to be innocent because 32-bit mode can't read those registers, until it was realized that user-space can far-jump to 64-bit mode, using the same CS value that the kernel uses for normal 64-bit user-space processes, selecting that system-wide GDT entry. So there was an actual info leak of kernel data, which was fixed by zeroing those registers.

IDK whether there used to be or still is a separate entry point from 64-bit user-space vs. 32-bit, or how they differ in struct pt_regs layout. The historical situation where int 0x80 leaked r8..r11 wouldn't have made sense for 64-bit user-space; that leak would have been obvious. So if they're unified now, they must not have been in the past.

@SepRoland: Yeah, I mean the actual answer is a very straightforward no, but the part that makes it at all interesting and worth answering is the calling-convention observation summarized from the `int 0x80` in 64-bit mode answer. But it's basically a parenthetical aside, a guess about why someone might have thought that. Maybe a `---` hrule would be better formatting. Yeah, I'll do that since I want to edit anyway to tweak something else. — Peter Cordes, Oct 11 '21 at 20:04

Ammar Faizi · Answer 2 · 2021-10-14T06:45:04.463

3

According to x86-64 ABI about syscall section A.2 AMD64 Linux Kernel Conventions, A.2.1 Calling Conventions [1]:

User-level applications use as integer registers for passing the sequence %rdi, %rsi, %rdx, %rcx, %r8 and %r9. The kernel interface uses %rdi, %rsi, %rdx, %r10, %r8 and %r9.

A system-call is done via the syscall instruction. The kernel destroys registers %rcx and %r11.

The number of the syscall has to be passed in register %rax.

System-calls are limited to six arguments, no argument is passed directly on the stack.

Returning from the syscall, register %rax contains the result of the system-call. A value in the range between -4095 and -1 indicates an error, it is -errno.

Only values of class INTEGER or class MEMORY are passed to the kernel.

From (2), (5) and (6), we can conclude that Linux x86-64 syscall clobbers %rax, %rcx and %r11 (and "memory").

Link: https://gitlab.com/x86-psABIs/x86-64-ABI/-/wikis/x86-64-psABI [1]

edited Oct 14 '21 at 06:45

answered Oct 13 '21 at 03:49

Ammar Faizi

1,393
2
11
26

2

Note that A.2 is marked "informative only". These aren't rules that Linux is obliged to follow. It's the other way around - the kernel developers define the calling conventions for system calls, and those conventions are simply included in the ABI appendix for the convenience of readers. Unfortunately, as far as I can tell, the kernel developers never saw fit to formally document these conventions beyond the kernel source itself, so the ABI authors presumably just described what the kernel actually does. – Nate Eldredge Oct 13 '21 at 04:00
1

Fortunately the ABI doc, and this summary of or quote from it, *is* accurate for Linux. (But not for all other OSes like Darwin / MacOS that do use the x86-64 SysV ABI for their user-space calling conventions, e.g. MacOS returns error status in CF instead of in-band via -errno.) Everything this answer says is already on SO in other places (e.g. [this Q&A](https://stackoverflow.com/questions/38751614/what-are-the-return-values-of-system-calls-in-assembly)), but it's not a bad thing to have it all in one place. – Peter Cordes Oct 13 '21 at 05:25
1

Although it's not quite accurate to say *the kernel* destroys RCX and R11. As I noted in my answer, the [`syscall`](https://www.felixcloutier.com/x86/syscall) instruction itself destroys them, before the kernel gets control. *Invoking* the kernel destroys them. [Why do x86-64 Linux system calls modify RCX, and what does the value mean?](https://stackoverflow.com/q/47983371) – Peter Cordes Oct 13 '21 at 05:28
@NateEldredge yeah it seems so. Now they are going to clarify the ABI document to make this contract more explicit, see the full story here: https://lore.kernel.org/lkml/alpine.LSU.2.20.2110131601000.26294@wotan.suse.de/ – Ammar Faizi Oct 14 '21 at 03:07
@PeterCordes _kernel destroys RCX and R11_ is not accurate, agree with that. The ABI should say that the "syscall" instruction that clobbers them, not the kernel. – Ammar Faizi Oct 14 '21 at 03:10
Is that part of your answer a direct quote, not your paraphrase? If so, use quote formatting around the whole list. – Peter Cordes Oct 14 '21 at 03:36
@AmmarFaizi: Thanks for the LKML link about your patch. As documented in the `syscall(7)` man page, its normal for the kernel to *not* clobber any registers. Zeroing some in the sysret return path would have be possible, since we'd know they hadn't been changed by ptrace, but at this point would be ABI-breaking for anything that inlines `syscall` (without making non-standard assumptions like nolibc's macros). – Peter Cordes Oct 14 '21 at 04:03
The [sample `_start`](https://github.com/torvalds/linux/blob/348949d9a4440abdab3b1dc99a9bb660e8c7da7c/tools/include/nolibc/nolibc.h#L409) is broken, too, also for i386: RSP needs to be aligned *before* the `call`, not after call pushes a return address. So remove the `sub $8, %rsp`. Also, `mov %eax, %edi` to pass the full 32-bit return value from main; it is visible in at least one obscure way outside the process. And modern glibc uses syscall `231` `exit_group`, `_exit` is only ever used by pthread_exit. [Syscall implementation of exit()](https://stackoverflow.com/a/46903734) – Peter Cordes Oct 14 '21 at 04:05
Zeroing AVX registers as someone suggested on LKML would be interesting, but you'd want to do it on *entry* so xsaveopt or whatever could use a more compact format. But it's normally already done by user-space via `vzeroupper` before any non-inline function call. Unless we want to manually write the FP save-area with zeros in the same format that xsaveopt or whatever would have used, to (for the purposes of a context switch) actually zero *all* the registers, including zmm16..31 and the low half of ymm0..15. But then you need diff code depending on CPU features (xsave version, AVX512) – Peter Cordes Oct 14 '21 at 04:08
(I should probably just reply on the LKML, but then I'd have to create email headers with in-reply-to for that thread, since I'm not already subscribed.) – Peter Cordes Oct 14 '21 at 04:09
@PeterCordes yes, it's a direct quote, fixed the answer. – Ammar Faizi Oct 14 '21 at 06:47
@PeterCordes Do you want to submit a patch for that `_start`? If not, I can do it. – Ammar Faizi Oct 14 '21 at 06:47
@AmmarFaizi: If I don't get around to it in a couple days, go ahead. Or if you already understand the problems and can explain them clearly in the commit message, go ahead now. I'd be just as happy as long as it's efficient code and has accurate comments. BTW, the x86-64 SysV ABI guarantees that we enter user-space with RSP already 16-byte aligned, so probably just `mov (%rsp), %rdi` and LEA argv, then you don't have to realign RSP before `call main` if you want to depend on the kernel. – Peter Cordes Oct 14 '21 at 06:51

When does Linux x86-64 syscall clobber %r8, %r9 and %r10?

2 Answers2