How to access a process's kernel stack in linux kernel?

Question

I am trying to monitor which functions are being called up by a process during its course of execution. My aim is to know how much time a process spends in every function. The functions are pushed over a stack and popped when function call returns. I would like to know where in the kernel code this push and pop actually happens.

I found a void *stack field in task_struct. I am not sure if this is the field I am looking for. If it is, then what is the way to know how it is updated?

I have to write a module that will make use of this code. Please help me in this case.

With functions, you mean system calls? Or just any function in userland that your code calls, from userland too? — mcleod_ideafix, Apr 19 '15 at 10:24
By a function call, i mean all possible functions that are executed as a part of the process execution. For example, for a user level file read operation, the system call read() is called which is followed by kernel functions like do_page_fault(), do_generic_file_read() etc. — user3550605, Apr 19 '15 at 20:39
since I can't comment as of this moment, have you tried manning /proc/ http://man7.org/linux/man-pages/man5/proc.5.html and strace for the process? — macmania314, Apr 19 '15 at 10:28

myaut · Accepted Answer · 2015-04-19T22:00:43.953

The functions are pushed over a stack and popped when function call returns. I would like to know where in the kernel code this push and pop actually happens.

It doesn't happen in kernel code, it is done by processor. I.e. when x86 assembly CPU finds call instruction, it pushes IP onto stack, while ret instruction will pop that value.

You can patch every call and ret instructions in kernel with call my_tracing_routine and record instruction pointer there, than pass control to original callee/caller. There are tools for that: LTTng, SystemTap, and in-kernel interfaces like kprobes, ftrace... This approach called tracing.

But if patch all instructions, i.e. with SystemTap probe kernel.function("*"), you will kill performance, and probably system panic. So, you can't measure every function call, but you can measure every Nth function call, and hope that you will get equivalent results, but you will need large sample (i.e run program for couple of minutes) -- that is called profiling.

Linux is shipped with profiler perf:

# perf record -- dd if=/dev/zero of=/dev/null
...
^C

# perf report
9.75%  dd  [kernel.kallsyms]  [k] __clear_user
6.69%  dd  [kernel.kallsyms]  [k] __audit_syscall_exit
5.61%  dd  [kernel.kallsyms]  [k] fsnotify
4.73%  dd  [kernel.kallsyms]  [k] system_call_after_swapgs
4.37%  dd  [kernel.kallsyms]  [k] system_call
...

You may also use -g to collect call chains. By default perf uses CPU performance counters, so after N CPU cycles, interrupt is raised, and perf handler (it is already embedded into kernel) saves IP.

If you wish to collect stacks, you may do that with SystemTap:

# stap --all-modules -e '
    probe timer.profile { 
        if(execname() == "dd") { 
            println("----"); 
            print_backtrace(); } 
        }' -c 'dd if=/dev/zero of=/dev/null' 
...
    ----
0xffffffff813e714d : _raw_spin_unlock_irq+0x32/0x3c [kernel]
0xffffffff81047bb9 : spin_unlock_irq+0x9/0xb [kernel]
0xffffffff8104ac68 : get_signal_to_deliver+0x4f0/0x528 [kernel]
0xffffffff8100216f : do_signal+0x48/0x4b1 [kernel]
0xffffffff81002608 : do_notify_resume+0x30/0x63 [kernel]
0xffffffff813edd6a : int_signal+0x12/0x17 [kernel]

In this example SystemTap uses timer.profile probe which attaches to a perf event cpu-clock. To do so, it generates, builds and loads kernel module. You may check that with stap -k -p 3

Thank for the thorough reply! Can you please elaborate the SystemTap output thing? To get output of the form as above (a stack trace), where does in the kernel code perf or SystemTap utilities make changes/updates? — user3550605, Apr 19 '15 at 20:43

How to access a process's kernel stack in linux kernel?

1 Answers1

Linked