ARM Kernel Oops when interrupts are enabled in page fault handler or with preemptive scheduling

Question

Can you enable interrupts in page fault handler? Is there an ARM kernel contention with preemptive scheduling?

I got an ARM kernel oops in UDP receiving code with CONFIG_PREEMPT, or when interrupt is enabled in fault handler.

The problem is similar to what another user reported here. But in my case when I send 110% load UDP packets to the system (system drops about 10% packets), kernel oops in a few minutes. This happens only if there are some busybox shell scripts running, not if only the UDP receiving program is running. I've tracked the data addresses it always looks good, the buffer was allocated and used before it is freed.

There are two ways to avoid it:

[1] When changing scheduling from preempt (CONFIG_PREEMPT) to preempt_voluntary, the problem goes away. Is this a known issue with ARM on kernel 2.6.39? With preempt scheduling I also see problem in jffs2 after a long while, but not with preempt_voluntary.

For a moment I suspected it is the Ethernet DMA fully utilized the bus thus blocking CPU from loading its TLB entry thus causing page fault. I'm deducing because busybox scripts need to be in the picture, when a script is spawned it creates address space and load many TLB entries thus overloading the bus. If preempt_voluntary is a solution, can DMA blocking bus be ruled out?

The test I'm running is a LTIB kernel 2.6.39.4 lpclinux on a phy3250 based system.

[2] Some more tests showed that the page fault handler is nested by Ethernet interrupts. When disabling interrupts in the kernel page fault handler __dabt_svc, but keep it enabled in the user page fault handler __dabt_user, the problem goes away. If not, the nest level goes up to 4 and it oops'ed. So the question is: Is enabling interrupts in page fault handler correct?

The test code for [2] goes below. Lines with @@@@ are added or modified. Then capture the nesting level in do_DataAbort().

file arch/arm/kernel/entry-armv.S:
__dabt_svc:
    svc_entry
... ...
    @
    @ set desired IRQ state, then call main handler
    @
    debug_entry r1
    @@@@Not_Enable_Irq_In_Dabtsvc
    ldr r2, =armv_dabtsvc_count @@@@
    ldr r3, [r2]    @@@@
    add r3, r3, #1  @@@@
    str r3, [r2]    @@@@
    msr cpsr_c, r9 @@@@disable thisk
    mov r2, r2 @@@@add this extra inst
    mov r2, sp
    bl  do_DataAbort

    @
    @ IRQs off again before pulling preserved data off the stack
    @
    disable_irq_notrace

    ldr r2, =armv_dabtsvc_count @@@@
    ldr r3, [r2]    @@@@
    sub r3, r3, #1  @@@@
    str r3, [r2]    @@@@
    @
    @ restore SPSR and restart the instruction
    @
    ldr r2, [sp, #S_PSR]
    svc_exit r2             @ return from exception
 UNWIND(.fnend      )
ENDPROC(__dabt_svc)

And add the variable to the file too:

file arch/arm/kernel/entry-armv.S:
@@@@save nesting level:
    .data            @@@@
    .align           @@@@
armv_dabtsvc_count:  @@@@
    .long   0   @ count svc entry    @@@@

I'm trying to link all these up. Can kernel experts see whether all the tests make sense? Is disabling interrupts in page fault handler is a valid solution?

Edit: The oops in page fault handler is not the first failure. There was a "do_bad_area" in a proceeding alignment handler. Subsequently that failed fixup to unaligned access caused the page fault. Yes as someone commented below, fixing unaligned access is very troublesome. Those unaligned accesses are from ip_input, ip_fragment, and udp stack. Once I fixed all those in the stack, the problem is gone.

Edit again: The problem is with two operations in alignment handler: It fetches the instruction, and fetches data the instruction refers to. The oops is reported by data access, but the cause is fetching instruction failed with a first page fault failure. Since the fetch instruction is in kernel space, the page is always valid, that indicates a silicon bug. If change the code to fetch again it would succeed, that confirms it is more likely a silicon bug. Interrupt gets into the picture because of excess TLB flushing it brings in. For short, TLB loading is automatic thus fetching instruction in kernel space cannot fail. But still it failed.

You should contact the linux kernel mailing list about this. If it crashes then it's a bug. — Nico Erfurth, May 22 '13 at 07:52
Two other relevant SO QAs: [Page fault in interrupt context](http://stackoverflow.com/questions/4848457/page-fault-in-interrupt-context). [What happens when a mov instruction causes a page fault with interrupts disabled on x86](http://stackoverflow.com/questions/12607288/what-happens-when-a-mov-instruction-causes-a-page-fault-with-interrupts-disabled). — minghua, May 23 '13 at 01:09
No, you should not disable interrupts in the page fault handler. The page fault can happen in a kernel thread that is doing `copy_from_user()`. There is no reason to block interrupts in this case; you increase latency. Probably you are masking the problem with all of your suggestions; the kernel code is full of subtleties. — artless noise, May 23 '13 at 21:11
@artless: Great, I'm also a bit concerned. What do you think the problem is? I'm looking at do_alignment() at arch/kernel/mm/. Thinking interrupt should be enabled only after calling __get_user(). Is this right? — minghua, May 24 '13 at 01:33
For alignment, see this [question](http://stackoverflow.com/questions/16548059/how-to-trap-unaligned-memory-access). The question would by why is a driver/kernel module doing un-aligned accesses? — artless noise, May 24 '13 at 03:30
@artless: Thanks for pointing to the alignment question. Informative, though, but it is a separate question. In my case it is ip stack, and in the reference kernel commits it is ipv6 stack, that makes the unaligned access. There is a bug in the alignment handler, it messes up with address space when interrupt gets in, subsequently the next exception handler call will crash. By the way I believe copy_from_user() is not allowed to be called by fault handler as it is in interrupt context. Though in general disabling interrupt is not a good idea. — minghua, May 24 '13 at 18:51
The interrupts are masked for `user` mode, to prevent a mix-up in the alignment trap flags. The page fault handlers should not call `copy_from_user()`; they are part of the implementation. `__get_user()` is different, it is trying to get the **code/instruction** that caused the alignment trap. The `switch` statement manual decodes the instruction. — artless noise, May 24 '13 at 19:26
It is a [*hail mary*](http://en.wikipedia.org/wiki/Hail_Mary_pass) operation to fix-up kernel accesses. It is much better to fix the drivers, etc. However, Masta's suggestion is good. The [ARM Linux mailing list](http://lists.infradead.org/mailman/listinfo/linux-arm-kernel) is a better place to ask. My guess is they will say the same thing and ask you to upgrade to the latest code. — artless noise, May 24 '13 at 19:34
There is a comment on top of probe_kernel_address() that reads "We ensure that the __get_user() is executed in atomic context". That's the cause of the oops. Busybox scripts also get segment faults in this test. Once __get_user() is replaced, both problems are fixed. I will check the jffs2 problem too. Hopefully it is fixed too. No no, upgrading to latest kernel is not an option. There are vendor patches from lpclinux, we can only upgrade to latest on that project. Actually yes linux mailing list is a better place but just scared by its faq. Thanks for helping me out here! — minghua, May 24 '13 at 19:59

minghua · Answer 1 · 2013-06-03T17:20:53.480

I guess this is the answer (incomplete, to be tested):

There is a problem when enabling interrupt too early. The __get_user() is assumed to be used in atomic context when it is used with interrupt enabled in do_alignment(). If the interrupt-enabling is deferred to after that point, everything should be ok.

Please look into two kernel commits. The first one on Jun 25 2011, that defers interrupt-enabling. The second one on Feb 25 2013 which changes uses of __get_user() to probling_kernel_address().

The first commit:

The 3.x kernel removed interrupt-enabling in low-level handlers __dabt_svc and __dabt_user etc. The commit message:

git diff 8b418616..02fe2845 entry-armv.S
commit 02fe2845d6a837ab02f0738f6cf4591a02cc88d4
Author: Russell King <rmk+kernel@rm.linux.org.uk>
Date:   Sat Jun 25 11:44:06 2011 +0100

    ARM: entry: avoid enabling interrupts in prefetch/data abort handlers

    Avoid enabling interrupts if the parent context had interrupts enabled
    in the abort handler assembly code, and move this into the breakpoint/
    page/alignment fault handlers instead.

    This gets rid of some special-casing for the breakpoint fault handlers
    from the low level abort handler path.

    Acked-by: Will Deacon <will.deacon@arm.com>
    Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>

commit 8b4186160b7894ca4583f702a562856d5d9e9118
Author: Russell King <rmk+kernel@rm.linux.org.uk>
Date:   Sat Jun 25 19:25:02 2011 +0100

And the code diff snippet:

diff --git a/arch/arm/kernel/entry-armv.S b/arch/arm/kernel/entry-armv.S
index d644d02..c46bafa 100644
--- a/arch/arm/kernel/entry-armv.S
+++ b/arch/arm/kernel/entry-armv.S
@@ -185,20 +185,15 @@ ENDPROC(__und_invalid)
 __dabt_svc:
        svc_entry
... ...
        dabt_helper

        @
-       @ set desired IRQ state, then call main handler
+       @ call main handler
        @
-       debug_entry r1
-       msr     cpsr_c, r9
        mov     r2, sp
        bl      do_DataAbort
......

That confirms interrupts do not need to be enabled too early in fault handlers.

The second commit:

commit b255188f90e2bade1bd11a986dd1ca4861869f4d
Author: Russell King <rmk+kernel@arm.linux.org.uk>
Date:   Mon Feb 25 16:10:42 2013 +0000

    ARM: fix scheduling while atomic warning in alignment handling code

    Paolo Pisati reports that IPv6 triggers this warning:

    BUG: scheduling while atomic: swapper/0/0/0x40000100
    [<c001b1c4>] (unwind_backtrace+0x0/0xf0) from [<c0503c5c>] (__schedule_bug+0x48/0x5c)
    [<c0503c5c>] (__schedule_bug+0x48/0x5c) from [<c0508608>] (__schedule+0x700/0x740)
    [<c0508608>] (__schedule+0x700/0x740) from [<c007007c>] (__cond_resched+0x24/0x34)
    [<c007007c>] (__cond_resched+0x24/0x34) from [<c05086dc>] (_cond_resched+0x3c/0x44)
    [<c05086dc>] (_cond_resched+0x3c/0x44) from [<c0021f6c>] (do_alignment+0x178/0x78c)
    [<c0021f6c>] (do_alignment+0x178/0x78c) from [<c00083e0>] (do_DataAbort+0x34/0x98)
    [<c00083e0>] (do_DataAbort+0x34/0x98) from [<c0509a60>] (__dabt_svc+0x40/0x60)
    Exception stack(0xc0763d70 to 0xc0763db8)
    [<c0509a60>] (__dabt_svc+0x40/0x60) from [<c02a8490>] (__csum_ipv6_magic+0x8/0xc8)

Fix this by using probe_kernel_address() stead of __get_user().
 arch/arm/mm/alignment.c |   11 ++++-------

see my exploration as edits in the question section. that's about it. since the problem is only seen with heavy network traffic, if network load is throttled in the very first stage of packet receiving, it would not run into the problem. that is so far a valid workaround. what do you think? it could be the dma engine as the vendor released a patch that avoids using dma on nand. network uses dma. — minghua, Jan 24 '15 at 04:01

ARM Kernel Oops when interrupts are enabled in page fault handler or with preemptive scheduling

1 Answers1