What security issues arise in the absence of a kernel stack?

Question

What are the kinds of security issues that can result from kernel code using the regular stack of the process?

User mode code can load any value into e/rsp. It doesn’t need to be a valid address. — prl, Oct 05 '19 at 21:09

score 4 · Accepted Answer · edited Jun 20 '20 at 09:12

Trivial for unprivileged userspace to crash the kernel, and pretty easy to take it over or "just" gain root.

User-space control of the kernel stack pointer (used asynchronously by interrupts) destroys any possibility of "unprivileged" code, assuming said unprivileged code is machine-code controlled by a potential attacker. Or just bugs in user-space can then crash the kernel.

Crashing the kernel could be as simple as xor esp,esp / int 0x80 or waiting for a timer interrupt. That probably leads to a page-fault from trying to push the exception frame on an unmapped page, after RSP wraps to 0xFF...8. (The kernel uses the same page-tables as user-space; there's a bit in each PTE that marks it as kernel-only or not.) Failure in trying to deliver that page-fault leads to a another page fault or GPF, and boom you've triple-faulted.

Control of RSP also lets you trivially overwrite arbitrary kernel addresses with an exception frame, potentially affecting what happens on other cores.

Notice that I used int 0x80 instead of syscall because syscall jumps to the entry-point stored in an MSR without touching memory (or modifying RSP). The kernel could in theory check it was valid before doing anything in that case. But true interrupts (including software interrupts) push CS:RIP and RFLAGS before any kernel instructions run. On actual x86-64, interrupts use a kernel-RSP value from the TSS. If this didn't happen, user-space would control the virtual address for those stores. (IDK if it's even possible to configure things to use the user-space RSP unmodified, or if HW effectively enforces having a kernel stack / per-task kernel stacks.)

(Normally the kernel's syscall entry point uses swapgs and a load from gs:0 or something to load the kernel stack pointer from the bottom of the kernel stack.)

Taking over the kernel (local privilege escalation):

Start multiple threads in one process, so they all share the same virtual address space. (Or use POSIX shared memory or whatever and set RSP there).
One thread stores its stack pointer to a global where other threads can read it.
That thread makes a system call; the kernel uses its stack for kernel-space return addresses and data. Choose one like open() or stat() that will take some time for sys_open() or sys_stat() kernel functions to return, especially if they blocks on disk I/O during pathname resolution or accessing an inode.

Or even more simply, nanosleep. (A system call that sleeps leaves user-space state saved on the kernel stack where it will eventually ret back to after context-switching back to this task and returning from the call to schedule().) Blocking on disk I/O is unnecessarily complicated. Although it does expose lots of filesystem code as possible sources of register values; you can choose which return address you want to overwrite.
While that's happening, another user-space thread modifies that memory, gaining control of the kernel's RIP / EIP and data on the stack. Even with non-executable kernel stacks, there's a lot you can do. By reading the return addresses you can defeat kernel ASLR, and then know how to modify them to jump to whatever kernel code you want.

The kernel uses the same page-table as user-space so read/write/exec can be set by mprotect(PROT_EXEC) before making a system call. Executable stack pages would make code-injection trivial. But the SMEP bit (Supervisor Mode Execution Prevention) introduced in 2010 blocks this, disallowing ring 0 exec of user-space pages (the U/S bit in the page-table entry which will always be set by any page "owned" by user-space). Another more recent blog post.

You could still just ret to somewhere after permission checks in the create_module(2) system-call handler to get a module loaded from the filesystem, containing your code which runs in kernel space. The attack surface for ROP attacks is huge because the kernel has the implementation of every system call, including privileged ones. Not to mention various internal functions that system calls and other things use, and tons of driver code.

Broadwell introduced another feature, SMAP (Supervisor Mode Access Prevention) which would defend against this. While active, the kernel will fault if it tries to even read a user page. It has to be disabled around copy_to_user() and copy_from_user() but reaching those functions with RSP pointing to user-space memory seems unlikely. call would fault on pushing a return address. Possibly on a 32-bit kernel you might be able to make a system call with ESP just above the 1:3 user/kernel split so only some nested function call would cross from the bottom of the 1G kernel page into the highest user page. But if copy_to/from_user are leaf functions (or don't make any function calls while SMAP is disabled) we probably can't attack them.

Crashing the kernel would still be trivial with SMAP, but it makes non-DoS exploits harder. (That's it's purpose on real x86-64: turning possible exploits into faults.) Still, in our hypothetical x86 without kernel stacks, setting RSP to a kernel address and making a system call (and not using [RSP] in user-space) will allow overwrites of kernel data by kernel instructions, which SMAP doesn't stop. See below re: without multitasking.

Or if you don't actually want to run code in kernel mode, you could just ret to code that elevates your process to root, setting EUID = 0.

You can control values in registers when a ret is reached by choosing which system call you make and what args you pass. And which level of nested function calls to overwrite the return address in.

Notice that a blocking system call makes this attack possible even on a single-core machine, where the attacking thread can't run simultaneously with kernel code. It just has to get scheduled core before the victim system call returns, and that's what blocking makes possible.

On a toy system without multitasking (no way into the kernel that ever runs any other user-space code before to the place it left), "all" you could do is overwrite arbitrary kernel memory addresses with stack frames. Including for invalid system calls (like an RAX value that returns -ENOSYS on Linux) dumping user-space register contents in a known pattern and then returning without much disturbing much more stack space!!! Assuming a syscall entry point written something like Linux's, that checks the call number pretty early without a bunch of call/ret that would scribble garbage where you might not want it if you want to take over instead of just crash.

As soon as your invalid syscall returns, you restore RSP to a sane value and then make a system call that exploits whatever data you just overwrite, e.g. letting a system call succeed that normally wouldn't. e.g. chmod + chown to make an SUID-root executable, or if you managed to set your current tasks UID to zero then execve a new shell.

The SMEP flag in CR4 prevents execution of code in user-accessible pages while in ring 0. — prl, Oct 06 '19 at 11:31
@prl: ah thanks. So on a system using that, you'd need to rely on ROP attacks and kernel data overwrites. (Was already in the middle of an update adding more ideas about stuff, including a data overwrite that doesn't even require multitasking, just the syscall entry point pushing user-space registers, i.e. values controlled by the attacker. — Peter Cordes, Oct 06 '19 at 11:56

What security issues arise in the absence of a kernel stack?

1 Answers1

Trivial for unprivileged userspace to crash the kernel, and pretty easy to take it over or "just" gain root.

Taking over the kernel (local privilege escalation):