syscalls don't happen immediately but on certain CPU ticks or interrupts
Totally wrong. The CPU doesn't just sit there doing nothing until a timer interrupt. On most architectures, including x86-64, switching to kernel mode takes tens to hundreds of cycles, but not because the CPU is waiting for anything. It's just a slow operation.
Note that glibc provides function wrappers around nearly every syscall, so if you look at disassembly you'll just see a normal-looking function call.
What really happens (x86-64 as an example):
See the AMD64 SysV ABI docs, linked from the x86 tag wiki. It specifies which registers to put args in, and that system calls are made with the syscall
instruction. Intel's insn ref manual (also linked from the tag wiki) documents in full detail every change that syscall
makes to the architectural state of the CPU. If you're interested in the history of how it was designed, I dug up some interesting mailing list posts from the amd64 mailing list between AMD architects and kernel devs. AMD updated the behaviour before the release of the first AMD64 hardware so it was actually usable for Linux (and other kernels).
32bit x86 uses the int 0x80
instruction for syscalls, or sysenter
. syscall
isn't available in 32bit mode, and sysenter
isn't available in 64bit mode. You can run int 0x80
in 64bit code, but you still get the 32bit API that treats pointers as 32bit. (i.e. don't do it). BTW, perhaps you were confused about syscalls having to wait for interrupts because of int 0x80
? Running that instruction fires that interrupt on the spot, jumping right to the interrupt handler. 0x80
is not an interrupt that hardware can trigger, either, so that interrupt handler only ever runs after a software-triggered system call.
AMD64 syscall example:
#include <stdlib.h>
#include <unistd.h>
#include <linux/unistd.h> // for __NR_write
const char msg[]="hello world!\n";
ssize_t amd64_write(int fd, const char*msg, size_t len) {
ssize_t ret;
asm volatile("syscall" // volatile because we still need the side-effect of making the syscall even if the result is unused
: "=a"(ret) // outputs
: [callnum]"a"(__NR_write), // inputs: syscall number in rax,
"D" (fd), "S"(msg), "d"(len) // and args, in same regs as the function calling convention
: "rcx", "r11", // clobbers: syscall always destroys rcx/r11, but Linux preserves all other regs
"memory" // "memory" to make sure any stores into buffers happen in program order relative to the syscall
);
}
int main(int argc, char *argv[]) {
amd64_write(1, msg, sizeof(msg)-1);
return 0;
}
int glibcwrite(int argc, char**argv) {
write(1, msg, sizeof(msg)-1); // don't write the trailing zero byte
return 0;
}
compiles to this asm output, with the godbolt Compiler Explorer:
gcc's -masm=intel
output is somewhat MASM-like, in that it uses the OFFSET
keywork to get the address of a label.
.rodata
msg:
.string "hello world!\n"
.text
main: // using an in-line syscall
mov eax, 1 # __NR_write
mov edx, 13 # string length
mov esi, OFFSET FLAT:msg # string pointer
mov edi, eax # file descriptor = 1 happens to be the same as __NR_write
syscall
xor eax, eax # zero the return value
ret
glibcwrite: // using the normal way that you get from compiler output
sub rsp, 8 // keep the stack 16B-aligned for the function call
mov edx, 13 // put args in registers
mov esi, OFFSET FLAT:msg
mov edi, 1
call write
xor eax, eax
add rsp, 8
ret
glibc's write
wrapper function just puts 1 in eax and runs syscall
, then checks the return value and sets errno. Also handles restarting syscalls on EINTR and stuff.
// objdump -R -Mintel -d /lib/x86_64-linux-gnu/libc.so.6
...
00000000000f7480 <__write>:
f7480: 83 3d f9 27 2d 00 00 cmp DWORD PTR [rip+0x2d27f9],0x0 # 3c9c80 <argp_program_version_hook+0x1f8>
f7487: 75 10 jne f7499 <__write+0x19>
f7489: b8 01 00 00 00 mov eax,0x1
f748e: 0f 05 syscall
f7490: 48 3d 01 f0 ff ff cmp rax,0xfffffffffffff001 // I think that's -EINTR
f7496: 73 31 jae f74c9 <__write+0x49>
f7498: c3 ret
... more code to handle cases where one of those branches was taken