8

How does Linux determine the address of another process to execute with a syscall? Like in this example?

mov rax, 59 
mov rdi, progName
syscall

It seems there is a bit of confusion with my question, to clarify, what I was asking is how does syscall works, independently of the registers or arguments passed. How it knows where to jump, return etc when an other process is called.

  • 3
    It's unclear which part you have trouble with. `syscall` transfers control to the kernel at the entry point determined by the `IA32_LSTAR` MSR. The OS then loads the new program and invokes its entry point as read from the file header (for ELF files anyway). The new process lives in its own virtual address space, the layout is again determined by the headers and the loader along with ASLR. Physical memory is allocated by the OS dynamically. – Jester Jul 02 '19 at 14:12
  • So with a syscall the kernel takes control, if we are doing a print with a syscall for example, how it knows where to return when it ends? – Bryan Jimenez Chacon Jul 02 '19 at 14:27
  • 5
    `syscall` saves the return address to RCX. See [this](https://www.felixcloutier.com/x86/syscall) – Hasturkun Jul 02 '19 at 14:31
  • 1
    it hits an entry point in the kernel, and that code does some form of if-then else or look up table to determine which syscall is happening that leads to the right code/function. – old_timer Jul 02 '19 at 15:22
  • Possible duplicate of [What are the calling conventions for UNIX & Linux system calls on i386 and x86-64](https://stackoverflow.com/q/2535989/608639). More generally, [how does 64-bit linux syscall work site:stackoverflow.com](https://duckduckgo.com/?q=how+does+64-bit+linux+syscall+work+site:stackoverflow.com). – jww Jul 02 '19 at 17:56
  • https://www.felixcloutier.com/x86/syscall – Solomon Ucko Dec 11 '20 at 02:12

1 Answers1

10

syscall

The syscall instruction is really just an INTEL/AMD CPU instruction. Here is the synopsis:

IF (CS.L ≠ 1 ) or (IA32_EFER.LMA ≠ 1) or (IA32_EFER.SCE ≠ 1)
(* Not in 64-Bit Mode or SYSCALL/SYSRET not enabled in IA32_EFER *)
    THEN #UD;
FI;
RCX ← RIP; (* Will contain address of next instruction *)
RIP ← IA32_LSTAR;
R11 ← RFLAGS;
RFLAGS ← RFLAGS AND NOT(IA32_FMASK);
CS.Selector ← IA32_STAR[47:32] AND FFFCH (* Operating system provides CS; RPL forced to 0 *)
(* Set rest of CS to a fixed value *)
CS.Base ← 0;
        (* Flat segment *)
CS.Limit ← FFFFFH;
        (* With 4-KByte granularity, implies a 4-GByte limit *)
CS.Type ← 11;
        (* Execute/read code, accessed *)
CS.S ← 1;
CS.DPL ← 0;
CS.P ← 1;
CS.L ← 1;
        (* Entry is to 64-bit mode *)
CS.D ← 0;
        (* Required if CS.L = 1 *)
CS.G ← 1;
        (* 4-KByte granularity *)
CPL ← 0;
SS.Selector ← IA32_STAR[47:32] + 8;
        (* SS just above CS *)
(* Set rest of SS to a fixed value *)
SS.Base ← 0;
        (* Flat segment *)
SS.Limit ← FFFFFH;
        (* With 4-KByte granularity, implies a 4-GByte limit *)
SS.Type ← 3;
        (* Read/write data, accessed *)
SS.S ← 1;
SS.DPL ← 0;
SS.P ← 1;
SS.B ← 1;
        (* 32-bit stack segment *)
SS.G ← 1;
        (* 4-KByte granularity *)

The most important part are the two instructions that save and manage the RIP register:

RCX ← RIP
RIP ← IA32_LSTAR

So in other words, there must be code at the address saved in IA32_LSTAR (a register) and RCX is the return address.

The CS and SS segments are also tweaked so your kernel code will be able to further run at CPU Level 0 (a privileged level.)

The #UD may happen if you do not have the right to execute syscall or if the instruction doesn't exist.

How is RAX interpreted?

This is just an index into a table of kernel function pointers. First the kernel does a bounds-check (and returns -ENOSYS if RAX > __NR_syscall_max), then dispatches to (C syntax) sys_call_table[rax](rdi, rsi, rdx, r10, r8, r9);

; Intel-syntax translation of Linux 4.12 syscall entry point
       ...                 ; save user-space registers etc.
    call   [sys_call_table + rax * 8]       ; dispatch to sys_execve() or whatever kernel C function

;;; execve probably won't return via this path, but most other calls will
       ...                 ; restore registers except RAX return value, and return to user-space

Modern Linux is more complicated in practice because of workarounds for x86 vulnerabilities like Meltdown and L1TF by changing the page tables so most of kernel memory isn't mapped while user-space is running. The above code is a literal translation (from AT&T syntax) of call *sys_call_table(, %rax, 8) from ENTRY(entry_SYSCALL_64) in Linux 4.12 arch/x86/entry/entry_64.S (before Spectre/Meltdown mitigations were added). Also related: What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? has some more details about the kernel side of system-call dispatching.

Fast?

The instruction is said to be fast. This is because in the old days one would have to use an instruction such as INT3. The interrupts make use of the kernel stack, it pushes many registers on the stack and uses the rather slow IRET to exit the exception state and return to the address just after the interrupt. This is generally much slower.

With the syscall you may be able to avoid most of that overhead. However, in what you're asking, this is not really going to help.

Another instruction which is used along syscall is swapgs. This gives the kernel a way to access its own data and stack. You should look at the Intel/AMD documentation about those instructions for more details.

New Process?

The Linux system has what it calls a task table. Each process and each thread within a process is actually called a task.

When you create a new process, Linux creates a task. For that to work, it runs codes which does things such as:

  • Make sure the executable exists
  • Setup a new task (including parsing the ELF program headers from that executable to create memory mappings in the newly-created virtual address space.)
  • Allocates a stack buffer
  • Load the first few blocks of the executable (as an optimization for demand paging), allocating some physical pages for the virtual pages to map to.
  • Setup the start address in the task (ELF entry point from the executable)
  • Mark the task as ready (a.k.a. running)

This is, of course, super simplified.

The start address is defined in your ELF binary. It really only needs to determine that one address and save it in the task current RIP pointer and "return" to user-space. The normal demand-paging mechanism will take care of the rest: if the code is not yet loaded, it will generate a #PF page-fault exception and the kernel will load the necessary code at that point. Although in most cases the loader will already have some part of the software loaded as an optimization to avoid that initial page-fault.

(A #PF on a page that isn't mapped would result in the kernel delivering a SIGSEGV segfault signal to your process, but a "valid" page fault is handled silently by the kernel.)

All new processes usually get loaded at the same virtual address (ignoring PIE + ASLR). This is possible because we use the MMU (Memory Management Unit). That coprocessor translates memory addresses between virtual address spaces and physical address space.

(Editor's note: the MMU isn't really a coprocessor; in modern CPUs virtual memory logic is tightly integrated into each core, along side the L1 instruction/data caches. Some ancient CPUs did use an external MMU chip, though.)

Determine the Address?

So, now we understand that all processes have the same virtual address (0x400000 under Linux is the default chosen by ld). To determine the real physical address we use the MMU. How does the kernel decide of that physical address? Well, it has a memory allocation function. That simple.

It calls a "malloc()" type of function which searches for a memory block which is not currently used and creates (a.k.a. loads) the process at that location. If no memory block is currently available, the kernel checks for swapping something out of memory. If that fails, the creation of the process fails.

In case of a process creation, it will allocate pretty large blocks of memory to start with. It is not unusual to allocate 1Mb or 2Mb buffers to start a new process. This makes things go a lot faster.

Also, if the process is already running and you start it again, a lot of the memory used by the already running instance can be reused. In that case the kernel does not allocate/load those parts. It will use the MMU to share those pages that can be made common to both instances of the process (i.e. in most cases the code part of the process can be shared since it is read-only, some part of the data can be shared when it is also marked as read-only; if not marked read-only, the data can still be shared if it wasn't modified yet--in this case it's marked as copy on write.)

Alexis Wilke
  • 19,179
  • 10
  • 84
  • 156
  • In your "new process" section, the ELF program loader does have to parse the ELF program headers and set up all the mappings. It doesn't have to put those into the HW page tables, though. But SEGFAULT is a POSIX signal that's only delivered on an *invalid* page-fault. You're talking about demand-paging in response to a hardware `#PF` exception. When the page-fault is *valid* (on a page that is supposed to be mapped), the kernel handles it silently without delivering SIGSEGV. Also, I have no idea why an answer about kernel syscall entry points has a section about demand paging. – Peter Cordes Jul 02 '19 at 20:11
  • You claim that `CS` and `SS` have something to do with *physical* addresses. No, segment->linear happens before virt->phys translation, not after. Kernel addresses are virtual, including `IA32_LSTAR`, and must be mapped in the page tables. That's why page tables have a U/S bit to protect kernel pages from user-space. (modulo Meltdown). `syscall` itself doesn't modify CR3, so the syscall entry point at least needs to be mapped. – Peter Cordes Jul 02 '19 at 20:16
  • 1
    I'm surprised you don't mention `swapgs`. *That's* how the kernel is intended to find the kernel stack, and from there more saved kernel data. (Note that `syscall` does *not* modify RSP). `swapgs` is the mechanism Linux uses. – Peter Cordes Jul 02 '19 at 20:18
  • 1
    Oh, I read the question more carefully, and it does ask something about starting a new process and maybe about virtual addresses in the new process. So that's why there's a section about that. – Peter Cordes Jul 02 '19 at 21:06
  • @PeterCordes, yes, the question was probably more about how a process gets loaded than the `syscall` instruction itself. I added a small section about `EAX` as well. Thanks for you edits! – Alexis Wilke Jul 02 '19 at 22:22
  • "and uses the rather slow `RTE` to exit the exception state" Never heard of an `rte` instruction, did you mean `iret`? – ecm Feb 02 '23 at 14:54