Calling system calls from the kernel code

Question

I am trying to create a mechanism to read performance counters for processes. I want this mechanism to be executed from within the kernel (version 4.19.2) itself.

I am able to do it from the user space the sys_perf_event_open() system call as follows.

syscall (__NR_perf_event_open, hw_event, pid, cpu, group_fd, flags);

I would like to invoke this call from the kernel space. I got some basic idea from here How do I use a Linux System call from a Linux Kernel Module

Here are the steps I took to achieve this:

To make sure that the virtual address of the kernel remains valid, I have used set_fs(), get_fs() and get_fd().
Since sys_perf_event_open() is defined in /include/linux/syscalls.h I have included that in the code.

Eventually, the code for calling the systems call looks something like this:

mm_segment_t fs;
fs = get_fs();
set_fs(get_ds());
long ret =  sys_perf_event_open(&pe, pid, cpu, group_fd, flags);
set_fs(fs);

Even after these measures, I get an error claiming "implicit declaration of function ‘sys_perf_event_open’ ". Why is this popping up when the header file defining it is included already? Does it have to something with the way one should call system calls from within the kernel code?

Mostly (based on this question and your previous questions) you need to spend a lot more time reading and understanding the existing code (including reading and understanding the low-level facilities different CPUs provide that the existing Linux code is built on top of); so that you can modify the existing code so that it either does what you want or provides functionality that other code (e.g. a kernel module) can use to do what you want. — Brendan, Mar 15 '19 at 13:55

score 3 · Answer 1 · answered Mar 15 '19 at 13:40

In general (not specific to Linux) the work done for systems calls can be split into 3 categories:

switching from user context to kernel context (and back again on the return path). This includes things like changing the processor's privilege level, messing with gs, fiddling with stacks, and doing security mitigations (e.g. for Meltdown). These things are expensive, and if you're already in the kernel they're useless and/or dangerous.
using a "function number" parameter to find the right function to call, and calling it. This typically includes some sanity checks (does the function exist?) and a table lookup, plus code to mangle input and output parameters that's needed because the calling conventions used for system calls (in user space) is not the same as the calling convention that normal C functions use. These things are expensive, and if you're already in the kernel they're useless and/or dangerous.
the final normal C function that ends up being called. This is the function that you might have (see note) been able to call directly without using any of the expensive, useless and/or dangerous system call junk.

Note: If you aren't able to call the final normal C function directly without using (any part of) the system call junk (e.g. if the final normal C function isn't exposed to other kernel code); then you must determine why. For example, maybe it's not exposed because it alters user-space state, and calling it from kernel will corrupt user-space state, so it's not exposed/exported to other kernel code so that nobody accidentally breaks everything. For another example, maybe there's no reason why it's not exposed to other kernel code and you can just modify its source code so that it is exposed/exported.

score 1 · Answer 2 · answered Jun 03 '19 at 13:05

Calling system calls from inside the kernel using the sys_* interface is discouraged for the reasons that others have already mentioned. In the particular case of x86_64 (which I guess it is your architecture) and starting from kernel versions v4.17 it is now a hard requirement not to use such interface (but for a few exceptions). It was possible to invoke system calls directly prior to this version but now the error you are seeing pops up (that's why there are plenty of tutorials on the web using sys_*). The proposed alternative in the Linux documentation is to define a wrapper between the syscall and the actual syscall's code that can be called within the kernel as any other function:

int perf_event_open_wrapper(...) {
    // actual perf_event_open() code
}

SYSCALL_DEFINE5(perf_event_open, ...) {
    return perf_event_open_wrapper(...);
}

source: https://www.kernel.org/doc/html/v4.19/process/adding-syscalls.html#do-not-call-system-calls-in-the-kernel

Lair · Answer 3 · 2019-03-15T10:44:11.763

0

Which kernel version are we talking about?

Anyhow, you could either get the address of the sys_call_table by looking at the System map file, or if it is exported, you can look up the symbol (Have a look at kallsyms.h), once you have the address to the syscall table, you may treat it as a void pointer array (void **), and find your desired functions indexed. i.e sys_call_table[__NR_open] would be open's address, so you could store it in a void pointer and then call it.

Edit: What are you trying to do, and why can't you do it without calling syscalls? You must understand that syscalls are the kernel's API to the userland, and should not be really used from inside the kernel, thus such practice should be avoided.

edited Mar 15 '19 at 10:44

answered Mar 15 '19 at 00:45

Lair

1
1

I am using kernel 4.19.2, no this call is not exported. I need the mechanism to be in the kernel itself. @Book Of Zeus I am trying to get the performance counter for every process and store it in its task structure. Can't figure out a way to this without syscall. – Nikhilesh Singh Mar 15 '19 at 06:10
If I understand you correctly, I think you might find kprobes handy, have a look at it. Besides, if the syscall table is not exported, I'd either have a look at the system's map file, or kprobe a function within the syscall (And then determine whether it was called from a syscall (The address should be in the stack which is available to you), and then do my thing. Because for example, syscall `kill` would invoke a call to `kill_something_info` – Lair Mar 15 '19 at 10:43

Basile Starynkevitch · Answer 4 · 2019-06-03T14:05:28.893

calling system calls from kernel code

^{(I am mostly answering to that title; to summarize: it is forbidden to even think of that)}

I don't understand your actual problem (I feel you need to explain it more in your question which is unclear and lacks a lot of useful motivation and context). But a general advice -following the Unix philosophy- is to minimize the size and vulnerability area of your kernel or kernel module code, and to deport, as much as convenient, such code in user-land, in particular with the help of systemd, as soon as your kernel code requires some system calls. Your question is by itself a violation of most Unix and Linux cultural norms.

Have you considered to use efficient kernel to user-land communication, in particular netlink(7) with socket(7). Perhaps you also want some driver specific kernel thread.

My intuition would be that (in some user-land daemon started from systemd early at boot time) AF_NETLINK with socket(2) is exactly fit for your (unexplained) needs. And eventd(2) might also be relevant.

But just thinking of using system calls from inside the kernel triggers a huge flashing red light in my brain and I tend to believe it is a symptom of a major misunderstanding of operating system kernels in general. Please take time to read Operating Systems: Three Easy Pieces to understand OS philosophy.

Calling system calls from the kernel code

4 Answers4

calling system calls from kernel code