Why do x86-64 Linux system calls modify RCX, and what does the value mean?

Question

I'm trying to allocate some memory in linux with sys_brk syscall. Here is what I tried:

BYTES_TO_ALLOCATE equ 0x08

section .text
    global _start

_start:
    mov rax, 12
    mov rdi, BYTES_TO_ALLOCATE
    syscall

    mov rax, 60
    syscall

The thing is as per linux calling convention I expected the return value to be in rax register (pointer to the allocated memory). I ran this in gdb and after making sys_brk syscall I noticed the following register contents

Before syscall

rax            0xc      12
rbx            0x0      0
rcx            0x0      0
rdx            0x0      0
rsi            0x0      0
rdi            0x8      8

After syscall

rax            0x401000 4198400
rbx            0x0      0
rcx            0x40008c 4194444 ; <---- What does this value mean?
rdx            0x0      0
rsi            0x0      0
rdi            0x8      8

I do not quite understand the value in the rcx register in this case. Which one to use as a pointer to the beginning of 8 bytes I allocated with sys_brk?

RCX and R11 are clobbered by the [SYSCALL](http://www.felixcloutier.com/x86/SYSCALL.html) instruction itself. From the instruction set reference: _after saving the address of the instruction following SYSCALL into RCX)_. _RFLAGS_ gets stored into _R11_ — Michael Petch, Dec 26 '17 at 20:25
@MichaelPetch Very interesting. It means in order to use, say `cl` register afterwards I need to clear it first, right? I mean for example `xor cl, cl` and then `mov cl, 7`. — St.Antario, Dec 26 '17 at 20:29
You can't rely on the value of _RCX_ or _R11_ after the SYSCALL. So you'll have to either use one of the other registers instead of _RCX_ and _R11_ (and RAX) or you will have to save the value (stack for example) and restore it after. _RCX_ and _R11_ don't get set by you, you just can't use them and expect them to be the same before and after the SYSCALL. — Michael Petch, Dec 26 '17 at 20:31
Clearing it before will get overwritten by the SYSCALL. SYSCALL will just overwrite what was in it. You can set it after the SYSCALL if you wish but if you do another SYSCALL the value will be clobbered. — Michael Petch, Dec 26 '17 at 20:37

score 19 · Accepted Answer · edited Jun 20 '20 at 09:12

The system call return value is in rax, as always. See What are the calling conventions for UNIX & Linux system calls on i386 and x86-64.

Note that sys_brk has a slightly different interface than the brk / sbrk POSIX functions; see the C library/kernel differences section of the Linux brk(2) man page. Specifically, Linux sys_brk sets the program break; the arg and return value are both pointers. See Assembly x86 brk() call use. That answer needs upvotes because it's the only good one on that question.

The other interesting part of your question is:

I do not quite understand the value in the rcx register in this case

You're seeing the mechanics of how the syscall / sysret instructions are designed to allow the kernel to resume user-space execution but still be fast.

syscall doesn't do any loads or stores, it only modifies registers. Instead of using special registers to save a return address, it simply uses regular integer registers.

It's not a coincidence that RCX=RIP and R11=RFLAGS after the kernel returns to your user-space code. The only way for this not to be the case is if a ptrace system call modified the process's saved rcx or r11 value while it was inside the kernel. (ptrace is the system call gdb uses). In that case, Linux would use iret instead of sysret to return to user space, because the slower general-case iret can do that. (See What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? for some walk-through of Linux's system-call entry points. Mostly the entry points from 32-bit processes, not from syscall in a 64-bit process, though.)

Instead of pushing a return address onto the kernel stack (like int 0x80 does), syscall:

sets RCX=RIP, R11=RFLAGS (so it's impossible for the kernel to even see the original values of those regs before you executed syscall).
masks RFLAGS with a pre-configured mask from a config register (the IA32_FMASK MSR). This lets the kernel disable interrupts (IF) until it's done swapgs and setting rsp to point to the kernel stack. Even with cli as the first instruction at the entry point, there'd be a window of vulnerability. You also get cld for free by masking off DF so rep movs / stos go upward even if user-space had used std.

Fun fact: AMD's first proposed syscall / swapgs design didn't mask RFLAGS, but they changed it after feedback from kernel developers on the amd64 mailing list (in ~2000, a couple years before the first silicon).
jumps to the configured syscall entry point (setting CS:RIP = IA32_LSTAR). The old CS value isn't saved anywhere, I think.
It doesn't do anything else, the kernel has to use swapgs to get access to an info block where it saved the kernel stack pointer, because rsp still has its value from user-space.

So the design of syscall requires a system-call ABI that clobbers registers, and that's why the values are what they are.

What is the use case of `sysret` instruction? Link you provided mentions that, it is companion instruction for `syscall`. But I have never seen `sysret` being used after a `syscall` instruction!! — Sourav Kannantha B, Jul 09 '21 at 09:13
@SouravKannanthaB: `syscall` calls into the kernel, `sysret` (in the kernel) returns to user-space. So the reason is the same as why you don't use `ret` after `call printf`, unless that happens to be the end of your function. See [What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code?](https://stackoverflow.com/q/46087730) for some details about how the kernel int 0x80 and syscall entry points work. — Peter Cordes, Jul 09 '21 at 18:15
If saved `RCX != RIP` or `R11 != RFLAGS`, linux kernel uses `iret` instead of `sysret`. Why not just restore `%rcx/%r11` with saved `RIP/RFLAGS` and use `sysret` (I think this will be faster?) — Fang Zhen, Jul 30 '21 at 02:14
@FangZhen: Because of CPU / ISA design bugs. e.g. if RIP is non-canonical, the CPU will #GP. But Intel CPUs will handle that exception without updating RSP (so it's the user stack), but the CPU is still in kernel mode. User-space could easily exploit that by having another thread modify that memory which is getting used as a kernel stack, after using ptrace to create a non-canonical RIP. So some checking is needed, and since `ptrace` or signals changing registers of other tasks are rare, it's fastest to just use a simple check. — Peter Cordes, Jul 30 '21 at 02:26
@FangZhen: See https://github.com/torvalds/linux/blob/e7d0c41ecc2e372a81741a30894f556afec24315/arch/x86/entry/entry_64.S#L260 for example. — Peter Cordes, Jul 30 '21 at 02:27
@PeterCordes As for the linked code, what happens if replace line 255-258 with following which avoids check `RCX==RIP`? ` movq RIP(%rsp), %rcx \n movq %rcx, %r11 ` — Fang Zhen, Jul 30 '21 at 05:53
@FangZhen: That would lead to wrong behaviour for cases where ptrace wanted to set RCX in another task. (Or maybe a signal handler modifying registers). e.g. in GDB, `set $rcx = 1` is something that's supposed to work, but your change would break it. (Maybe just for tasks that were stuck in a system call, or maybe for any task even if it was stopped at a breakpoint if that return path gets used there.) — Peter Cordes, Jul 30 '21 at 05:58

Why do x86-64 Linux system calls modify RCX, and what does the value mean?

1 Answers1

Linked

Related