How would I follow a system call from a trap to the kernel, to how arguments are passed, to how the system call in located in the kernel, to the actual processing of the system call in the kernel, to the return back to the user and how state is restored?
3 Answers
SystemTap
This is the most powerful method I've found so far. It can even show the call arguments: Does ftrace allow capture of system call arguments to the Linux kernel, or only function names?
Usage:
sudo apt-get install systemtap
sudo stap -e 'probe syscall.mkdir { printf("%s[%d] -> %s(%s)\n", execname(), pid(), name, argstr) }'
Then on another terminal:
sudo rm -rf /tmp/a /tmp/b
mkdir /tmp/a
mkdir /tmp/b
Sample output:
mkdir[4590] -> mkdir("/tmp/a", 0777)
mkdir[4593] -> mkdir("/tmp/b", 0777)
Documentation: https://sourceware.org/systemtap/documentation.html
Seems to be kprobes based: https://sourceware.org/systemtap/archpaper.pdf
Tested on Ubuntu 18.04, Linux kernel 4.15.
ltrace -S
shows both system calls and library calls
This awesome tool therefore gives even further visibility into what executables are doing.
Here for example I used it to analyze what system calls dlopen
is making: https://unix.stackexchange.com/questions/226524/what-system-call-is-used-to-load-libraries-in-linux/462710#462710
ftrace
minimal runnable example
Mentioned at https://stackoverflow.com/a/29840482/895245 but here goes a minimal runnable example.
Run with sudo
:
#!/bin/sh
set -eux
d=debug/tracing
mkdir -p debug
if ! mountpoint -q debug; then
mount -t debugfs nodev debug
fi
# Stop tracing.
echo 0 > "${d}/tracing_on"
# Clear previous traces.
echo > "${d}/trace"
# Find the tracer name.
cat "${d}/available_tracers"
# Disable tracing functions, show only system call events.
echo nop > "${d}/current_tracer"
# Find the event name with.
grep mkdir "${d}/available_events"
# Enable tracing mkdir.
# Both statements below seem to do the exact same thing,
# just with different interfaces.
# https://www.kernel.org/doc/html/v4.18/trace/events.html
echo sys_enter_mkdir > "${d}/set_event"
# echo 1 > "${d}/events/syscalls/sys_enter_mkdir/enable"
# Start tracing.
echo 1 > "${d}/tracing_on"
# Generate two mkdir calls by two different processes.
rm -rf /tmp/a /tmp/b
mkdir /tmp/a
mkdir /tmp/b
# View the trace.
cat "${d}/trace"
# Stop tracing.
echo 0 > "${d}/tracing_on"
umount debug
Sample output:
# tracer: nop
#
# _-----=> irqs-offhttps://sourceware.org/systemtap/documentation.html
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
mkdir-5619 [005] .... 10249.262531: sys_mkdir(pathname: 7fff93cbfcb0, mode: 1ff)
mkdir-5620 [003] .... 10249.264613: sys_mkdir(pathname: 7ffcdc91ecb0, mode: 1ff)
One cool thing about this method is that it shows the function call for all processes on the system at once, although you can also filter PIDs of interest with set_ftrace_pid
.
Documentation at: https://www.kernel.org/doc/html/v4.18/trace/index.html
Tested on Ubuntu 18.04, Linux kernel 4.15.
GDB step debug the Linux kernel
Depending on the level of internals detail you need, this is an option: How to debug the Linux kernel with GDB and QEMU?
strace
minimal runnable example
Here is a minimal runnable example of strace
: How should strace be used? with a freestanding hello world, which makes how everything works perfectly clear.
More info
https://en.pingcap.com/blog/how-to-trace-linux-system-calls-in-production-with-minimal-impact-on-performance might be worth a read, it mentions:
perf top -F 49 -e raw_syscalls:sys_enter --sort comm,dso --show-nr-samples
and the BPF-based traceloop: https://github.com/kinvolk/traceloop which the article claims to be a very fast method:
sudo -E ./traceloop cgroups --dump-on-exit /sys/fs/cgroup/system.slice/sshd.service

- 347,512
- 102
- 1,199
- 985
It's actually relatively easy to use ftrace
. Here's a classic article by Steven, "Mr. ftrace", Rostedt. The second part is here.
There is a free video by Jan-Simon Möller of the Linux Foundation, and many other good introductory articles that you can find using search terms such as "ftrace tutorial" or "ftrace example".

- 4,615
- 2
- 34
- 37
You can use the -f and -ff option. Something like this:
strace -f -e trace=process bash -c 'ls; :'
-f Trace child processes as they are created by currently traced processes as a result of the fork(2) system call.
-ff If the -o filename option is in effect, each processes trace is written to filename.pid where pid is the numeric process id of each process. This is incompatible with -c, since no per-process counts are kept.

- 168,305
- 31
- 280
- 331
-
Note: "process" refers to the *kernel*'s notion of a process, which is usually called a "thread" in userland. – o11c Apr 24 '15 at 06:48