2

I'm trying to build up a "big picture" of how things work in the Linux kernel and userspace, and I'm quite confused. I know that userspace make use of system calls to "talk" to the kernel, but I don't know how. I tried to read the C library and kernel source codes but they are complex and not easy to understand. I've also read several books regarding conceptual facts about operating systems, like managing processes, memory, devices, but they don't make the "transition" (userspace->kernel) clear. So, where exactly the transition between the userspace and kernel space happens? How does the C library run a code that's inside the Linux kernel running in the machine?

To make an analogy: imagine that there is a house. The house is locked. The key to open the house is inside the house itself. There's only one person inside the house, the kernel. The userspace is someone trying to enter the house. My question would be: how does the kernel knows there's someone outside the house wanting the key, and which mechanism allows the house to be opened with that key?

nowat
  • 120
  • 2
  • 9
  • 3
    They use the doorbell. This question is platform dependent. On `x86-64`, the kernel sets up the processor to dispatch `syscall`s to the kernel. `syscall` is an assembly instruction, which in the analogy acts as the doorbell. – Magnus Hoff Apr 29 '15 at 12:10
  • What do you mean by "the kernel sets up the processor to dispatch syscall to the kernel"? – nowat Apr 29 '15 at 12:12
  • In general, the kernel configures the processor by writing values to specific registers and places in memory. This is reasonable to do as soon as the kernel is loaded at boot. This information is used by the processor to handle the `syscall` instruction and lots of other stuff the way the kernel intended. – Magnus Hoff Apr 29 '15 at 12:19
  • A [related question on *syscalls and the 'C' library](http://stackoverflow.com/questions/572942/whats-the-difference-between-c-system-calls-and-c-library-routines). You may also look at [some](http://git.kernel.org/cgit/libs/klibc/klibc.git/tree/usr/klibc/arch) [source](https://sourceware.org/git/?p=glibc.git;a=tree;f=sysdeps;hb=HEAD). As per the answers below, the calls will not be pure 'C' but will at least need some compiler extensions or assembler. **Specifics** are 'platform dependent', but access protection between userspace and kernel must provide a `syscall` analogue. – artless noise Apr 30 '15 at 14:08

2 Answers2

12

That's quiet easy - the person can use the doorbell to let the kernel know it's waiting outside. And this doorbell in our case is usually a special CPU exception, software interrupt or dedicated instruction that a user-space application is allowed to use and the kernel can handle.

So the procedure is like this:

  1. First you need to know the system call number. Each syscall has its unique number and there is a table inside of the kernel that maps those numbers to specific functions. Each architecture can have different table entries for the same number. On two different architectures the same number may map to different syscalls.

  2. Then you set up your arguments. This is also architecture specific but is not much different from passing arguments between usual function calls. Usually, you will put your arguments in specific CPU registers. This is described in the ABI of this architecture.

  3. Then you enter syscall. Depending on the architecture this may mean causing some exception or executing a dedicated CPU instruction.

  4. The kernel has special handler function that runs in kernel mode when a syscall is called. It will pause process execution, storing all the information specific to this process (this is called context switch), read the syscall number and arguments and call proper syscall routine. It will also make sure to put the return value in proper place for user-space to read and to schedule the process back when the syscall routine is done (restoring its context).

As an example, to let the kernel know you want to call syscall on x86_64 you can use sysenter instruction with syscall number in %rax register. Arguments are passed using registers (if I remember correctly) %rdi, %rsi, %rdx, %rcx, %r8 and %r9.

You could also use an older way that was used on 32 bit x86 CPUs - a software interrupt number 0x80 (int 0x80 instruction). Again, syscall number is specified in %rax register and arguments go to (again, if I'm not mistaken) %ebx, %ecx, %edx, %esi, %edi, %ebp.

ARM is very similar - you will use "supervisor call" instruction (SVC #0). Your syscall number will go to r7 register, all the arguments will go to registers r0-r6 and the return value of syscall will be stored in r0.

Other architectures and operating systems use similar techniques. The details may vary - software interrupt numbers may be different, arguments may be passed using different registers or even using stack but the core idea is the same.

ecm
  • 2,583
  • 4
  • 21
  • 29
Krzysztof Adamski
  • 2,039
  • 11
  • 14
  • Are these registers virtual or physical? (the kernel creates virtual memory for each process, are the registers also virtual? or the C library writes straight into the processor when setting up the system call?) – nowat Apr 29 '15 at 12:41
  • @yuri: I don't know what you mean by virtual register. Userspace always uses real CPU registers. There may be more than one process in the system but only one can run on each CPU at the time. If this process is paused, context switch happens and all the register values are saved to memory. They are restored when process is scheduled again. That's how registers are never overwritten by different process even though they all share them. – Krzysztof Adamski Apr 29 '15 at 12:54
1

Many processors have an instruction to call a specific "trap" or "interrupt", the Linux kernel sets up such a "trap" or "interrupt" specifically for systems calls.

The library sets up processor registers in a certain way, and then performs the special trap or interrupt instruction, which causes the processor to enter privileged mode and call the kernel's trap/interrupt handler function, which decodes the values in the registers and calls the appropriate function to handle the system call.

That is the most common way, and basically how it's done for just about all systems that need isolation between kernel and user-space.

ecm
  • 2,583
  • 4
  • 21
  • 29
Some programmer dude
  • 400,186
  • 35
  • 402
  • 621