Description
I'm working on an embedded Linux system (using kernel 3.4 and bionic, Android-like) that is running on a multi-core ARMv7a SoC. We have a user-space thread that is basically servicing events coming out of the kernel. The events are generated from an IRQ and must be reacted to by user-space with very low latency.
A thread runs with SCHED_FIFO priority 0. It is the only priority 0 thread in the system. Approximate code for the thread:
while (1)
{
struct pollfd fds[1];
fds[0].fd = fd;
fds[0].events = POLLIN|POLLRDNORM|POLLPRI;
int ret = poll(fds, 1, reallyLongTimeout);
FTRACE("poll() exit");
if (ret > 0)
{
// notify worker threads of pending events
}
}
Generally we get very good latency (the thread makes a full round-trip back into poll() within the same millisecond the IRQ happened), however randomly we have delays into the tens of milliseconds that wreck everything. After tracing it all over the place I have come to the conclusion that the delay happens after the IRQ fires and before the poll() system call returns, because the thread puts itself asleep. Then some time after somehow gets woken up by some unknown force and everything continues on.
I suspected some other IRQ but after enabling sched:,irq:,timer:* tracing I had to rule it out. I had some difficulty porting syscalls:* tracers to the ARM kernel. The syscalls tracers work but if I also enable sched:* I get all sorts of panics inside the ring_buffer code.
After inserting some custom trace points into sys_poll() I've come to the uncomfortable conclusion that my thread falls asleep after sys_poll() returned, but before it re-emerges in user space.
Here's an annotated trace with my custom trace points in fs/select.c:
<my thread>-915 [001] ...1 17.589394: custom: do_poll:786 - calling do_pollfd
<my thread>-915 [001] ...1 17.589399: custom: do_poll:794 - failed, no events
<my thread>-915 [001] ...1 17.589402: custom: do_poll:823 - going to sleep, count = 0, timed_out = 0
.... // everything going OK, then the IRQ happens, which runs a tasklet:
<random proc>-834 [000] d.h1 17.616541: irq_handler_entry: irq=17 name=hwblock handler=hw_block_process_irq
<random proc>-834 [000] d.h1 17.616569: softirq_raise: vec=6 [action=TASKLET]
<random proc>-834 [000] d.h1 17.616570: irq_handler_exit: irq=17 ret=handled
<random proc>-834 [000] ..s2 17.616627: softirq_entry: vec=6 [action=TASKLET]
.... // the tasklet signals the wait queue of the poll, which wakes up my thread:
<my thread>-915 [001] ...1 17.616827: custom: do_poll:826 - woke up, count = 0, timed_out = 0
<my thread>-915 [001] ...1 17.616833: custom: do_poll:772 - start of loop
<my thread>-915 [001] ...1 17.616840: custom: do_poll:786 - calling do_pollfd
<my thread>-915 [001] ...1 17.616852: custom: do_poll:788 - success, event!
<my thread>-915 [001] ...1 17.616859: custom: do_poll:810 - bailing, count = 1, timed_out = 0
<my thread>-915 [001] ...1 17.616862: custom: do_sys_poll:880 - before free_wait()
<my thread>-915 [001] ...1 17.616867: custom: do_sys_poll:882 - before __put_user()
<my thread>-915 [001] ...1 17.616872: custom: sys_poll:940 - do_sys_poll - exit
.... // the tasklet exits, and my thread appears to be about to be
<random proc>-834 [000] .Ns2 17.616922: softirq_exit: vec=6 [action=TASKLET]
.... // wait wait, why is my thread going back to sleep, and what was it doing for 75us?
<my thread>-915 [001] d..3 17.616947: sched_stat_wait: comm=<another thread> pid=1165 delay=1010000 [ns]
<my thread>-915 [001] ...2 17.616957: sched_switch: prev_comm=<my thread> prev_pid=915 prev_prio=0 prev_state=S ==> next_comm=<another thread> next_pid=1165 next_prio=120
.... // everything running on for 20ms as if nothing is wrong, then my thread suddenly gets woken up.
.... // nothing pid 947 is doing should have any effect on <my thread>
<random proc>-947 [000] d..4 17.636087: sched_wakeup: comm=<my thread> pid=915 prio=0 success=1 target_cpu=001
<random proc>-1208 [001] ...2 17.636212: sched_switch: prev_comm=<rancom proc> prev_pid=1208 prev_prio=120 prev_state=R ==> next_comm=<my thread> next_pid=915 next_prio=0
<my thread>-915 [001] ...1 17.636713: tracing_mark_write: poll() exit
So somewhere my thread is becoming TASK_INTERRUPTIBLE
and then voluntarily walking into the scheduler, and then... waking up for apparently no reason 20ms later.
The occurrence of this seems at least somewhat timing dependent, and various attempts to observe it often make it harder to reproduce.
Questions
- Any ideas what's causing this?
- Any suggestions on easy ways of finding out where my thread is falling asleep?
- Any suggestions on easy ways of finding out why my thread wakes up?
- I've considered somehow adapting
unwind_backtrace()
to generate a single string I can stuff into eachtrace_sched_switch()
call but that seems a little daunting. Anything simpler along the same lines? - Any idea why tracing syscalls:* and sched:* makes it explode with an unhandled page fault inside the ring buffer code where it needs to move the tail? It seems like it's dereferencing a user-space pointer (based on number similarity) but a different one each time.
Things I have already tried and checked
It's not a normal IRQ that runs too long or something with disabled interrupts. Tracing with irq:* shows that. It might be some sort of a TrustZone NMI but somehow I doubt it.
It shouldn't be a RT throttling/timeslicing of any sort, because:
a) sched_rt_runtime_us=10000 and sched_rt_period_us=10000
b) The thread has a fairly low duty cycle (<30ms per second, in 60-80 events per second)
It's probably not an artifact of having tracing or writing to
/sys/kernel/debug/tracing/trace_marker
from user space -- it happens without that macro and with tracing disabled (even compiled out of the kernel). Also the code related to that in trace.c and ring_buffer.c seems to be mostly lock-free.There's nothing else with priority 0, and it's not getting pre-empted but rather seems to willingly unschedule itself.
I put a panic() at the top of
syscall_trace()
to make sure I'm not accidentally falling into one of the tracing/auditing paths on my way out ofsys_poll()
. It didn't fire, so that's not it.
Thank you very much in advance
Update #1
I gave up on finding something easy and implemented an unwind_backtrace_to_str()
function that lets me instrument various tracepoints with backtrace information. After adding backtraces to trace_sched_switch() and trace_sched_wake() I managed to isolate several causes of latency, with the main two being:
Priority inversion due to mm->mmap_sem being taken by some other thread in the same process doing a
fork()
/mmap()
/munmap()
and therefore not available duringfutex_wait()
ortracing_mark_write()
for the RT thread. This one can mostly be avoided by restructuring some code and using vfork() instead of fork() in some places.Failure to run scheduled task when
sched_wake()
is called from a different source CPU than the CPU it needs to run on. This seems to be a bigger problem. I traced it through the scheduler and it seems that in the bad casewake_up_process()
callstry_to_wake_up()
which winds up callingttwu_queue()
which is where things get interesting.
Inside ttwu_queue()
we do not enter the 'if' because cpus_share_cache()
always returns true for any of our cores (that sounds right, shared L2). Which means it simply calls ttwu_do_activate()
for the task and exits. ttwu_do_activate()
seems to only put the task on the right run queue and mark it as TASK_RUNNING
, but does not have any SMP handling.
I added the following after p->state = TASK_RUNNING;
in ttwu_do_wakeup()
#ifdef CONFIG_SMP
if (task_cpu(p) != smp_processor_id())
smp_send_reschedule(task_cpu(p));
#endif
and it fixes the problem by forcing the target CPU to run the scheduler. However, I suspect this isn't how it's supposed to work, and even if this is a real bug then there's probably a more refined fix. I checked the latest kernel (3.14) and the code in core.c looks pretty much the same.
Any ideas why this is? Why does it not call ttwu_queue_remote()
if cpus_share_cache()
returns true? So what if they share cache -- I could see how that's relevant for a migration decision, but whether the wakeup is done locally or remotely? Maybe our cpus_share_cache()
should be returning false? That function doesn't seem to be well documented (or I'm not looking in the right places)