When a syscall is called by a userspace program, how does execution transfer back to kernelspace?

Question

I've been studying a lot about the ABI for x86-64, writing Assembly, and studying how the stack and heap work.

Given the following code:

#include <linux/seccomp.h>
#include <stdlib.h>
#include <unistd.h>

int main(int argc, char *argv[]) {
    // execute the seccomp syscall (could be any syscall)
    seccomp(...);

    return 0;
}

In Assembly for x86-64, this would do the following:

Align the stack pointer (as it's off by 8 bytes by default).
Setup registers and the stack for any arguments for the call seccomp.
Execute the following Assembly call seccomp.
When seccomp returns, it's likely the that the C will call exit(0) as far as I know.

I'd like to talk about what happens between step three and four above.

I currently have my stack for the currently running process with its own data in registers and on the stack. How does the userspace process turn over execution to the kernel? Does the kernel just pick up when the call is made and then push to and pop from the same stack?

I believe I heard somewhere that syscalls don't happen immediately but on certain CPU ticks or interrupts. Is this true? How does this happen, for example, on Linux?

For a function call, you're correct. However, for a _system call_, no. The _function_ `seccomp()` will load up the arguments off the stack and into specific registers in a specific order, load the system call number for `seccomp` into `rax`, then execute the instruction `int 0x80`, `sysenter` or `syscall`. The CPU "traps" and hands control to the kernel, which executes an interrupt service routine that determines system call numbered `rax` was requested and performs it. The kernel reports the return value in `rax` when it returns control to the process. — Iwillnotexist Idonotexist, Mar 31 '16 at 03:38
@IwillnotexistIdonotexist Excellent reply, should be an answer! — Naftuli Kay, Mar 31 '16 at 03:45
Also @IwillnotexistIdonotexist, is there a delay between the syscall and the kernel beginning to execute or is the context switch pretty seamless? ie do I have to wait for a timer/tick to fire on the CPU for the kernel to wake up and attempt to do things or will it begin execution immediately at where the syscall was called? — Naftuli Kay, Mar 31 '16 at 03:50
@PeterCordes gave an excellent elaboration of my comment. And as he points out, the CPU doesn't wait to the next tick. If a CPU is interrupted, it services the interrupt as fast as possible (within a few hundred cycles due to context switching). The difference is that a program can voluntarily interrupt itself with a `syscall` instruction, or a timer can involuntarily interrupt the execution of a program to hand control to the kernel. The kernel sets up this timer for the express purpose of intervening at regular intervals to share CPU time slices amongst processes — Iwillnotexist Idonotexist, Mar 31 '16 at 17:17
Excellent. Thank you for your response. My next question might involve what context switching is :) — Naftuli Kay, Mar 31 '16 at 17:19
Ah, now that involves 1) Jumping to the ISR that handles syscalls 2) Saving all of the dozens of register values of the thread into a book-keeping structure and 3) Substituting them with the kernel thread's register values. Then afterwards the system call's implementation is invoked, runs, returns, and the reverse process is followed to switch back to user mode, except that `rax`'s value is deliberately replaced with the return code of the syscall. This is expected by the user-mode process and is part of the ABI (the contract or promise) between user and kernel modes. — Iwillnotexist Idonotexist, Mar 31 '16 at 17:34
@IwillnotexistIdonotexist: The user->kernel transition during a syscall does save user-space registers, but there's no separate kernel thread to switch to and restore registers from. The kernel entry-point code does modify RSP to point at this thread's kernel stack, but other registers are left holding the syscall args, or garbage, when the entry-point asm runs a CALL instruction to invoke the C syscall handling code. See [the source in `arch/x86/entry/entry_64.S`](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S#L105) — Peter Cordes, Nov 06 '16 at 13:13

score 7 · Accepted Answer · edited May 23 '17 at 12:01

syscalls don't happen immediately but on certain CPU ticks or interrupts

Totally wrong. The CPU doesn't just sit there doing nothing until a timer interrupt. On most architectures, including x86-64, switching to kernel mode takes tens to hundreds of cycles, but not because the CPU is waiting for anything. It's just a slow operation.

Note that glibc provides function wrappers around nearly every syscall, so if you look at disassembly you'll just see a normal-looking function call.

What really happens (x86-64 as an example):

See the AMD64 SysV ABI docs, linked from the x86 tag wiki. It specifies which registers to put args in, and that system calls are made with the syscall instruction. Intel's insn ref manual (also linked from the tag wiki) documents in full detail every change that syscall makes to the architectural state of the CPU. If you're interested in the history of how it was designed, I dug up some interesting mailing list posts from the amd64 mailing list between AMD architects and kernel devs. AMD updated the behaviour before the release of the first AMD64 hardware so it was actually usable for Linux (and other kernels).

32bit x86 uses the int 0x80 instruction for syscalls, or sysenter. syscall isn't available in 32bit mode, and sysenter isn't available in 64bit mode. You can run int 0x80 in 64bit code, but you still get the 32bit API that treats pointers as 32bit. (i.e. don't do it). BTW, perhaps you were confused about syscalls having to wait for interrupts because of int 0x80? Running that instruction fires that interrupt on the spot, jumping right to the interrupt handler. 0x80 is not an interrupt that hardware can trigger, either, so that interrupt handler only ever runs after a software-triggered system call.

AMD64 syscall example:

#include <stdlib.h>
#include <unistd.h>
#include <linux/unistd.h>    // for __NR_write

const char msg[]="hello world!\n";

ssize_t amd64_write(int fd, const char*msg, size_t len) {
  ssize_t ret;
  asm volatile("syscall"  // volatile because we still need the side-effect of making the syscall even if the result is unused
               : "=a"(ret)                   // outputs
               : [callnum]"a"(__NR_write),   // inputs: syscall number in rax,
                "D" (fd), "S"(msg), "d"(len)    // and args, in same regs as the function calling convention
               : "rcx", "r11",               // clobbers: syscall always destroys rcx/r11, but Linux preserves all other regs
                 "memory"                    // "memory" to make sure any stores into buffers happen in program order relative to the syscall 
              );
}

int main(int argc, char *argv[]) {
    amd64_write(1, msg, sizeof(msg)-1);
    return 0;
}

int glibcwrite(int argc, char**argv) {
    write(1, msg, sizeof(msg)-1);  // don't write the trailing zero byte
    return 0;
}

compiles to this asm output, with the godbolt Compiler Explorer:

gcc's -masm=intel output is somewhat MASM-like, in that it uses the OFFSET keywork to get the address of a label.

.rodata
msg:
        .string "hello world!\n"

.text
main:   // using an in-line syscall
        mov     eax, 1    # __NR_write
        mov     edx, 13   # string length
        mov     esi, OFFSET FLAT:msg      # string pointer
        mov     edi, eax  # file descriptor = 1 happens to be the same as __NR_write
        syscall
        xor     eax, eax  # zero the return value
        ret

glibcwrite:  // using the normal way that you get from compiler output
        sub     rsp, 8       // keep the stack 16B-aligned for the function call
        mov     edx, 13      // put args in registers
        mov     esi, OFFSET FLAT:msg
        mov     edi, 1
        call    write
        xor     eax, eax
        add     rsp, 8
        ret

glibc's write wrapper function just puts 1 in eax and runs syscall, then checks the return value and sets errno. Also handles restarting syscalls on EINTR and stuff.

// objdump -R -Mintel -d /lib/x86_64-linux-gnu/libc.so.6
...
00000000000f7480 <__write>:
   f7480:       83 3d f9 27 2d 00 00    cmp    DWORD PTR [rip+0x2d27f9],0x0        # 3c9c80 <argp_program_version_hook+0x1f8>
   f7487:       75 10                   jne    f7499 <__write+0x19>
   f7489:       b8 01 00 00 00          mov    eax,0x1
   f748e:       0f 05                   syscall
   f7490:       48 3d 01 f0 ff ff       cmp    rax,0xfffffffffffff001   // I think that's -EINTR
   f7496:       73 31                   jae    f74c9 <__write+0x49>
   f7498:       c3                      ret
   ... more code to handle cases where one of those branches was taken

score 5 · Answer 2 · answered Mar 31 '16 at 03:40

syscalls don't happen immediately but on certain CPU ticks or interrupts

Certainly the effect of your syscall could be dependent on many things including ticks. Scheduler granularity or the resolution of timing could be limited to tick period, e.g. But the call itself should happen "immediately" (inline with execution).

How does the userspace process turn over execution to the kernel? Does the kernel just pick up when the call is made and then push to and pop from the same stack?

It probably varies slightly between architectures but in general the syscall arguments are assembled by the libc and then a processor exception is generated in order to change context.

For additional details, see: "How system calls work on x86 linux"

When a syscall is called by a userspace program, how does execution transfer back to kernelspace?

2 Answers2

What really happens (x86-64 as an example):

AMD64 syscall example: