Tracking down mysterious high-priority thread suspend inside the kernel

Question

Description

I'm working on an embedded Linux system (using kernel 3.4 and bionic, Android-like) that is running on a multi-core ARMv7a SoC. We have a user-space thread that is basically servicing events coming out of the kernel. The events are generated from an IRQ and must be reacted to by user-space with very low latency.

A thread runs with SCHED_FIFO priority 0. It is the only priority 0 thread in the system. Approximate code for the thread:

    while (1)
    {
        struct pollfd fds[1];
        fds[0].fd = fd;
        fds[0].events = POLLIN|POLLRDNORM|POLLPRI;

        int ret = poll(fds, 1, reallyLongTimeout);
        FTRACE("poll() exit");
        if (ret > 0)
        {
            // notify worker threads of pending events
        }
    }

Generally we get very good latency (the thread makes a full round-trip back into poll() within the same millisecond the IRQ happened), however randomly we have delays into the tens of milliseconds that wreck everything. After tracing it all over the place I have come to the conclusion that the delay happens after the IRQ fires and before the poll() system call returns, because the thread puts itself asleep. Then some time after somehow gets woken up by some unknown force and everything continues on.

I suspected some other IRQ but after enabling sched:,irq:,timer:* tracing I had to rule it out. I had some difficulty porting syscalls:* tracers to the ARM kernel. The syscalls tracers work but if I also enable sched:* I get all sorts of panics inside the ring_buffer code.

After inserting some custom trace points into sys_poll() I've come to the uncomfortable conclusion that my thread falls asleep after sys_poll() returned, but before it re-emerges in user space.

Here's an annotated trace with my custom trace points in fs/select.c:

 <my thread>-915   [001] ...1    17.589394: custom: do_poll:786 - calling do_pollfd
 <my thread>-915   [001] ...1    17.589399: custom: do_poll:794 - failed, no events
 <my thread>-915   [001] ...1    17.589402: custom: do_poll:823 - going to sleep, count = 0, timed_out = 0

.... // everything going OK, then the IRQ happens, which runs a tasklet:

 <random proc>-834 [000] d.h1    17.616541: irq_handler_entry: irq=17 name=hwblock handler=hw_block_process_irq
 <random proc>-834 [000] d.h1    17.616569: softirq_raise: vec=6 [action=TASKLET]
 <random proc>-834 [000] d.h1    17.616570: irq_handler_exit: irq=17 ret=handled
 <random proc>-834 [000] ..s2    17.616627: softirq_entry: vec=6 [action=TASKLET]

.... // the tasklet signals the wait queue of the poll, which wakes up my thread:

 <my thread>-915   [001] ...1    17.616827: custom: do_poll:826 - woke up, count = 0, timed_out = 0
 <my thread>-915   [001] ...1    17.616833: custom: do_poll:772 - start of loop
 <my thread>-915   [001] ...1    17.616840: custom: do_poll:786 - calling do_pollfd
 <my thread>-915   [001] ...1    17.616852: custom: do_poll:788 - success, event!
 <my thread>-915   [001] ...1    17.616859: custom: do_poll:810 - bailing, count = 1, timed_out = 0
 <my thread>-915   [001] ...1    17.616862: custom: do_sys_poll:880 - before free_wait()
 <my thread>-915   [001] ...1    17.616867: custom: do_sys_poll:882 - before __put_user()
 <my thread>-915   [001] ...1    17.616872: custom: sys_poll:940 - do_sys_poll - exit

.... // the tasklet exits, and my thread appears to be about to be

 <random proc>-834 [000] .Ns2    17.616922: softirq_exit: vec=6 [action=TASKLET]


.... // wait wait, why is my thread going back to sleep, and what was it doing for 75us?

 <my thread>-915   [001] d..3    17.616947: sched_stat_wait: comm=<another thread> pid=1165 delay=1010000 [ns]
 <my thread>-915   [001] ...2    17.616957: sched_switch: prev_comm=<my thread> prev_pid=915 prev_prio=0 prev_state=S ==> next_comm=<another thread> next_pid=1165 next_prio=120

.... // everything running on for 20ms as if nothing is wrong, then my thread suddenly gets woken up.
.... // nothing pid 947 is doing should have any effect on <my thread>

<random proc>-947  [000] d..4    17.636087: sched_wakeup: comm=<my thread> pid=915 prio=0 success=1 target_cpu=001
<random proc>-1208 [001] ...2    17.636212: sched_switch: prev_comm=<rancom proc> prev_pid=1208 prev_prio=120 prev_state=R ==> next_comm=<my thread> next_pid=915 next_prio=0
<my thread>-915    [001] ...1    17.636713: tracing_mark_write: poll() exit

So somewhere my thread is becoming TASK_INTERRUPTIBLE and then voluntarily walking into the scheduler, and then... waking up for apparently no reason 20ms later.

The occurrence of this seems at least somewhat timing dependent, and various attempts to observe it often make it harder to reproduce.

Questions

Any ideas what's causing this?
Any suggestions on easy ways of finding out where my thread is falling asleep?
Any suggestions on easy ways of finding out why my thread wakes up?
I've considered somehow adapting unwind_backtrace() to generate a single string I can stuff into each trace_sched_switch() call but that seems a little daunting. Anything simpler along the same lines?
Any idea why tracing syscalls:* and sched:* makes it explode with an unhandled page fault inside the ring buffer code where it needs to move the tail? It seems like it's dereferencing a user-space pointer (based on number similarity) but a different one each time.

Things I have already tried and checked

It's not a normal IRQ that runs too long or something with disabled interrupts. Tracing with irq:* shows that. It might be some sort of a TrustZone NMI but somehow I doubt it.
It shouldn't be a RT throttling/timeslicing of any sort, because:

a) sched_rt_runtime_us=10000 and sched_rt_period_us=10000

b) The thread has a fairly low duty cycle (<30ms per second, in 60-80 events per second)
It's probably not an artifact of having tracing or writing to /sys/kernel/debug/tracing/trace_marker from user space -- it happens without that macro and with tracing disabled (even compiled out of the kernel). Also the code related to that in trace.c and ring_buffer.c seems to be mostly lock-free.
There's nothing else with priority 0, and it's not getting pre-empted but rather seems to willingly unschedule itself.
I put a panic() at the top of syscall_trace() to make sure I'm not accidentally falling into one of the tracing/auditing paths on my way out of sys_poll(). It didn't fire, so that's not it.

Thank you very much in advance

Update #1

I gave up on finding something easy and implemented an unwind_backtrace_to_str() function that lets me instrument various tracepoints with backtrace information. After adding backtraces to trace_sched_switch() and trace_sched_wake() I managed to isolate several causes of latency, with the main two being:

Priority inversion due to mm->mmap_sem being taken by some other thread in the same process doing a fork()/mmap()/munmap() and therefore not available during futex_wait() or tracing_mark_write() for the RT thread. This one can mostly be avoided by restructuring some code and using vfork() instead of fork() in some places.
Failure to run scheduled task when sched_wake() is called from a different source CPU than the CPU it needs to run on. This seems to be a bigger problem. I traced it through the scheduler and it seems that in the bad case wake_up_process() calls try_to_wake_up() which winds up calling ttwu_queue() which is where things get interesting.

Inside ttwu_queue() we do not enter the 'if' because cpus_share_cache() always returns true for any of our cores (that sounds right, shared L2). Which means it simply calls ttwu_do_activate() for the task and exits. ttwu_do_activate() seems to only put the task on the right run queue and mark it as TASK_RUNNING, but does not have any SMP handling.

I added the following after p->state = TASK_RUNNING; in ttwu_do_wakeup()

#ifdef CONFIG_SMP
      if (task_cpu(p) != smp_processor_id())
          smp_send_reschedule(task_cpu(p));
#endif

and it fixes the problem by forcing the target CPU to run the scheduler. However, I suspect this isn't how it's supposed to work, and even if this is a real bug then there's probably a more refined fix. I checked the latest kernel (3.14) and the code in core.c looks pretty much the same.

Any ideas why this is? Why does it not call ttwu_queue_remote() if cpus_share_cache() returns true? So what if they share cache -- I could see how that's relevant for a migration decision, but whether the wakeup is done locally or remotely? Maybe our cpus_share_cache() should be returning false? That function doesn't seem to be well documented (or I'm not looking in the right places)

Do you have any custom device drivers which might be mucking about with the process's scheduling structures? — wallyk, May 29 '14 at 20:45
Yuriy Romanenko, What is your current `uname -a`? Did you test RT (realtime) version of linux kernel (https://www.kernel.org/pub/linux/kernel/projects/rt/3.4/), if you want RT tasks? It may fix this bug... — osgx, Jun 01 '14 at 00:30

Anton · Answer 1 · 2014-05-31T23:22:16.060

9

Just a wild guess since the is no any answer yet.. You say the system is multicore. Do you assign the affinity for the user thread to run on the same core where the interrupt occurs? And does the interrupt occur on the specific core only? I suspect a situation when a user thread runs on one core but the interrupt happens on the other one and cannot resume here immediately (not sleeping yet?). Probably a data race allows it to fall asleep e.g. just before the interrupt handler publishes some data which the thread polls. Thus, it is suspended until the next system interrupt (e.g. timer).

So, try to assign the interrupt and the thread to the same core in order to serialize them and avoid potential data races.

in response to update #1

Looks like I was right about the data race between the cores since raising IRQ on the target core fixes the issue. I guess it is not in the kernel code because of excessive reschedule IRQs and so additional scheduling overheads just for sake of very rare cases or just because it can be done faster using usual synchronization assuming the shared cache.

And there is some synchronization which looks like the right direction but apparently it misses something. I'd try running a reproducer on a different architecture/versions to understand whether it is a general bug or just specific to your platform/kernel versions. I hope it's not a missing fence on the p->on_cpu load/store..

Anyway, returning to your specific problem, if you cannot or don't want to use the custom kernel build with your hot-fix, my suggestion with thread affinity still stands actual and valid.

Additionally, if you cannot pin the interrupt to one particular core, you may want to run such a polling thread on each core (also explicitly pinned to it) to ensure that at least one of the threads will get the event immediately after the IRQ. Of course, it leads to additional synchronization burden on the user thread code.

edited May 31 '14 at 23:22

answered May 30 '14 at 08:32

Anton

6,349
1
25
53

It seems you are at least partially onto something here. See my edit. – Yuriy Romanenko May 31 '14 at 21:09
@YuriyRomanenko, thanks, I have no much to add here (though, updated), still suggesting to try set thread affinity if you don't want to hack the kernel. – Anton May 31 '14 at 23:31
I doubt it's a fence/barrier issue. The task is not on_cpu or on_rq yet. The most pathological case I did see is when CPU0 does a sched_wake of an RT task on CPU1, and CPU1 is idle and in WFI. Then nothing happens. CPU1 did not run the task for 100ms and the only reason it did so is because some HW IRQ woke it up and on the way out it went back through the scheduler and saw it had stuff to do. So an IPI was simply not generated even though the CPU was idle. – Yuriy Romanenko Jun 01 '14 at 00:44
@YuriyRomanenko, I see, then it looks more like a deliberate decision (try RT kernel? some of the patches looks very close) or a logic miss since the IPI here seems to be either the only solution or at least more efficient from cache reuse and resource utilization standpoints than e.g. resuming the thread on the CPU0 instead. – Anton Jun 01 '14 at 07:49
@YuriyRomanenko, Please clarify what you actually want at this point, to fix the kernel or to workaround its behavior? – Anton Jun 01 '14 at 07:52
@Anton - we can certainly fix the kernel. I don't know that we can make as potentially destabilizing a change as swapping out the entire kernel at this point though; does that even work in the Linux/Android world? – Carl Norum Jun 02 '14 at 02:11
@CarlNorum, I'm not sure what's your concern. If you are rather afraid of destabilizing your kernel by a custom change, try other but verified kernel versions and configurations, or just apply my suggestion because it is agnostic to kernel changes (including ones in the future). – Anton Jun 02 '14 at 17:59
@YuriyRomanenko if you have access to the bootloader, try adding `nosmp` to the command line. It does not change the kernel image, it will just boot it differently. You would thus have a good view if Anton's proposal may fix your issue. – m-ric Jun 04 '14 at 21:13

score 8 · Answer 2 · answered Jun 05 '14 at 18:14

We wound up going with the following fixes:

smp_send_reschedule(task_cpu(p)); mentioned above in the scheduler to allow cross-CPU prevention. I'll follow up with a maintainer to see if it's the correct fix.
Implementing get_user_pages_fast() for our platform that does not lock mmap_sem if it doesn't have to. This removed contention between mmap/munmap and futex_wait
Switching to vfork() + execve() in a couple of places in userspace code where fork() was unnecessary. This removed contention between mmap/munmap and calls that spawn other processes.

It seems like everything is running smoothly now.

Thank you for all your help.

Tracking down mysterious high-priority thread suspend inside the kernel

2 Answers2