5

I'm looking for ways to learn which syscalls or which subsystems a process or thread spends time waiting in, i.e. blocked and not scheduled to run on a CPU.

Specifically if I have some unknown process, or a process where all we know is "it's slow" I'd like to be able to learn things like:

  • "it spends 80% of its time in sys_write() on fd 13 which is /some/file"
  • "it's spending a lot of time waiting to read() from a network socket"
  • "it's sleeping in epoll_wait() for activity on fds [4,5,6] which are [file /boo], [socket 10.1.1.:42], [notifyfd blah]"

In other words when my program is not running on the CPU what is it doing?

This is astonishingly hard to answer with perf because it does not appear to have any way to record the duration of a syscall from sys_enter to sys_exit or otherwise keep track of how long an event is. Presumably due to its sampling nature.

I'm aware of some experimental work with eBPF for Linux 4.6 and above that may help, with Brendan Gregg's off-cpu work. But in the sad world of operations and support a 4.6 kernel is a rare unicorn to be treasured.

What are the real world options?

Do ftrace, systemtap etc offer any insights here?

Craig Ringer
  • 307,061
  • 76
  • 688
  • 778

1 Answers1

4

You can use strace. First, you might want to get a high-level summary of the costs of each type of system call. You can obtain this summary by running strace -c. For example, one possible output is the following:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 90.07    0.000263          26        10           getdents
  3.42    0.000010           0      1572           read
  3.42    0.000010           0       762         1 stat
  3.08    0.000009           0      1574         6 open
  0.00    0.000000           0        11           write
  0.00    0.000000           0      1569           close
  0.00    0.000000           0        48           fstat

The % time value is with respect to overall kernel time, not overall execution time (kernel+user). This summary tells you what the most expensive system calls are. However, if you need to determine which specific instances of system calls are most expensive and what arguments are passed to them, you can run strace -i -T. The -i option shows the instruction addresses of the instructions that performed the system call and the -T option the time spent in the system call. An output might look like this:

[00007f97f1b37367] open("myfile", O_RDONLY|O_CLOEXEC) = 3 <0.000020>
[00007f97f1b372f4] fstat(3, {st_mode=S_IFREG|0644, st_size=159776, ...}) = 0 <0.000018>
[00007f97f1b374ba] mmap(NULL, 159776, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f97f1d19000 <0.000019>
[00007f97f1b37467] close(3)             = 0 <0.000018>

The first column shows instruction addresses, the second column shows systems calls with their arguments, the third column shows the returned value, and the last column shows the time spent in that system call. This list is ordered by the dynamic occurrence of the system call. You can filter this output using either grep or the -e option. The instruction addresses can help you locate where in the source code these system calls are made. For example, if a long sequence of system calls have the same address, then there is a good chance that you have a loop somewhere in the code that contains the system call. If your executable binary is not PIE, the dynamic addresses are the same as the static addresses shown by objdump. But even with PIE, the relative order of the dynamic addresses is the same. I don't know if there is an easy way to map these system calls to source code lines.

If you want to find out things like "it spends 80% of its time in sys_write() on fd 13 which is /some/file" then you need to write a script that first extracts the return values of all open calls and the corresponding file name arguments and then sum up the times of all sys_write calls whose fd argument is equal to some value.

Hadi Brais
  • 22,259
  • 3
  • 54
  • 95