Are Linux system calls executed inside an exception handler?

Question

I understand that after entering a system call with e.g. syscall, int 0x80 (x86/x86-64) or svc (ARM) instruction, we stay in the calling process context (but switch from user to kernel mode) from Linux kernel point of view. However, from hardware point of view, we jump into a syscall/svc/... exception handler. Is the whole system call code executed inside the exception handler in Linux?

In a certain sense, yes. But I'm not sure it's useful to think of it as being "inside the handler"; rather that the interrupt / exception / system call handling mechanism was used as a way to transition between unprivileged and privileged code. — Nate Eldredge, Apr 20 '21 at 22:29

score 4 · Accepted Answer · answered Apr 21 '21 at 00:14

Using the terminology that's common for 80x86 (from Intel's manuals, etc); the CPU has a "current privilege level" (CPL) that determines if code is restricted or not (e.g. if privileged instructions are permitted or not), and this is the basis of "user-space vs. kernel space". The things that trigger a switch from CPL=3 ("user space") to CPL=0 ("kernel space") are:

exceptions, which typically indicate that a problem (e.g. division by zero) was detected by the CPU
IRQs, which indicate that a device needs attention
software interrupts, call gates, and the syscall and sysenter instructions. These are all different ways for software to explicitly ask the OS/kernel for something (kernel system calls) where different operating systems/kernels may only support some or one of them (64-bit code will only need syscall and all the other alternatives probably won't be supported by the OS/kernel unless it's trying to provide backward compatibility for obsolete 32-bit stuff).
Task gates (obsolete, not supported for 64-bit and not used by any well known 32-bit OS).

Using this terminology; it'd be wrong to say that Linux system calls are executed in an exception handler (because an exception is something specific that isn't involved).

However...

Different people define terminology differently; and some people (ARM) define "exception" as a synonym for "anything that causes a switch to kernel space". This makes some sense for CPU designers who are primarily focused on the impact on the CPU that any switch to supervisor mode has and have little reason to care about the differences (because the differences are mostly a software developer's problem). For everyone else (software developers) by using that terminology you could say that everything in the kernel is used inside an exception handler; which mostly makes the word "exception" meaningless (because "could be anything at all" doesn't provide any additional information). In other words, using that terminology, "Linux system calls are executed inside an exception handler" is technically correct but could be shortened to "Linux system calls are executed" without changing the statement's meaning.

Note: Recently Intel published a draft proposal for a possible future extension that would (if adopted and supported by CPU and enabled by the OS) replace all of the above with a new "events" scheme; where many different/separate (exception, IRQ, system calls, ...) handlers are replaced by a single "event handler" (which would have to fetch an "event reason" provided by CPU and then branch to "event reason specific" code). If that happens I'd expect a third set of terminology (e.g. "exception event" and "IRQ event" and "system call event", where all of kernel's code is executed in the context of some kind of event; and where "Linux system calls are executed inside an event handler" would be technically correct but could be shortened to "Linux system calls are executed").

Peter Cordes · Answer 2 · 2021-04-21T18:54:06.203

No. Most importantly, syscall / sysenter aren't either an exception or interrupt at all; See below.

But also, "interrupts" (including software interrupts like int 0x80) are different from "exceptions" (events caused by error conditions) in Intel terminology.

For an "exception", the saved RIP is the faulting instruction (like you want for a #PF page-fault, so returning to user-space with iret will retry that instruction. Which is what you want after adjusting the page tables for a valid page fault, as opposed to one that will result in the kernel delivering a SIGSEGV). Also, some exceptions will push an error code along with RFLAGS and CS:RIP.

A software interrupt like int 0x80 produces a saved EIP/RIP of the instruction after, so iret will continue instead of re-running the same instruction, without the kernel having to manually modify the saved context. So it's quite similar to an exception in that it pushes the RFLAGS and a CS:RIP onto the stack and jumps to a CS:RIP address loaded from the IDT, but it differs in exactly what saved-RIP value is pushed. Either way code executes at privilege-level (ring) 0, but that saved-RIP = instruction after the trapping one lets it conveniently be used as a remote procedure call (from user-space into the kernel).

(semi-related What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? shows some of the kernel side of the syscall and int 0x80 handlers in a 64-bit Linux kernel. From before the changes for Meltdown / Spectre mitigation which made things more complicated.)

And of course syscall doesn't use the interrupt / exception mechanism at all (no IDT, nothing pushed on kernel stack). Instead it uses RCX and R11 to save the user-space RIP and RFLAGS, and sets RIP = IA32_LSTAR_MSR (which the kernel sets to point at its syscall entry point). And it doesn't use TSS stuff to set RSP to the kernel stack-pointer; the kernel has to do that itself. (Usually using swapgs to get access to per-core or per-task storage where it can save the user-space RSP and load a kernel stack pointer. In Linux, the kernelgs points to the bottom of the kernel stack, the lowest address / last to be used, IIRC.)

sysenter uses a different mechanism, but similar idea I think with kernel entry address coming from an MSR, instead of having to be loaded from the IDT every time with all the machinery of parsing an IDT entry type.

The syscall and sysenter entry points are a bit like interrupt handlers, but an iret wouldn't get you back to user-space. (Instead, sysret or sysexit would, given the state of registers / stack.)

Note that Intel's terminology differs from yours. Intel uses *exception* for events caused by an error condition and that can optionally push an error code. The term *interrupt* instead means an HW interrupt or a `int n` instruction. Exceptions can be emulated with interrupts only if they don't push an error condition. For example, `int3` specifically generates an exception but since this excp has no error code, it can be emulated (and is totally equivalent to) an ordinary `int 3`. — Margaret Bloom, Apr 21 '21 at 10:02
For an interrupt, RIP will always point to the "next instruction" (you know well the concept of next can be quite difficult to define for HW interrupts, let's not spend time on it), for an exception it depends on the type.A fault will set RIP to the faulting instruction, a trap to the next instruction (`int3` is a trap for example, otherwise the debugger will loop without adjusting RIP). — Margaret Bloom, Apr 21 '21 at 10:02
@MargaretBloom: Thanks for the terminology reminder of exactly what Intel means with their terminology. Updated to avoid appearing to give a definition of "exception"; I think that was the only problem you were pointing out, and the rest of your comments are a nice footnote. — Peter Cordes, Apr 21 '21 at 18:55

user123 · Answer 3 · 2021-04-21T00:25:27.450

In 32 bits x86 Linux, the sysenter instruction is used. The sysenter instruction jumps to the address specified in an MSR. The sysenter instruction isn't an interruption. It jumps to the address specified in the MSR (that was put there at boot by Linux).

In x64 Linux, the syscall instruction is used instead. It works the same way as with sysenter.

Have a look at the following Q&A on StackOverflow: Who sets the RIP register when you call the clone syscall?. I provided an answer which is quite complete.

Also, what I didn't mention is that, when you link a program statically, all the glibc code is added to your executable up to the syscall instruction. Your code thus relies on the presence of the OS to run (because otherwise there isn't anything to jump to).

The answer is thus: no the system calls aren't executed in an interrupt handler.

Are Linux system calls executed inside an exception handler?

3 Answers3

Linked