3

My child process is trying to access an PCI address space. It works fine most of the times.

But, sometimes the child process is going to zombie state. dmesg logs shows the following bus error.

[  501.134156] Caused by (from MCSR=10008): Bus - Read Data Bus Error
[  501.134169] Oops: Machine check, sig: 7 [#1]

There is no core file generated in this case.

[Linux:/]$ ps -axl | grep tes1
4     0  6805 32495  20   0      0     0 exit   Zl   ?  0:05 [test1] <defunct>
[Linux:/]$ 

Core is generated for SIGSEGV error by the child process. So I assume it has nothing to do with permission/ulimit settings.

Can someone help me to understand why core is not getting generated in this case?

Child Process:
--------------

[Linux:/]$ cat /proc/6805/status
Name:   test1
State:  Z (zombie)
Tgid:   6805
Pid:    6805
PPid:   32495
TracerPid:  0
Uid:    0   0   0   0
Gid:    0   0   0   0
FDSize: 0
Groups: 
Threads:    2
SigQ:   18/13007
SigPnd: 0000000002000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000001006
SigCgt: 0000000182000200
CapInh: 0000000000000000
CapPrm: 0000001fffffffff
CapEff: 0000001fffffffff
CapBnd: 0000001fffffffff
Seccomp:    0
Cpus_allowed:   3
Cpus_allowed_list:  0-1
voluntary_ctxt_switches:    8998
nonvoluntary_ctxt_switches: 857

   Stack:
   -------

[Linux:/]$ cat /proc/6805/stack
[<00000000>]    (nil)
[<c0008640>] __switch_to+0xc0/0x160
[<c004b4f4>] do_exit+0x5d4/0xa70
[<c000c694>] die+0x224/0x310
[<c000ce44>] machine_check_exception+0x124/0x1e0
[<c00123bc>] ret_from_mcheck_exc+0x0/0x14c
[Linux:/]$ 


Parent Process:
---------------
[Linux:/]$ cat /proc/32495/status
Name:   test
State:  S (sleeping)
Tgid:   32495
Pid:    32495
PPid:   21911
TracerPid:  0
Uid:    0   0   0   0
Gid:    0   0   0   0
FDSize: 256
Groups: 
VmPeak:     4820 kB
VmSize:     4820 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:      2548 kB
VmRSS:      2548 kB
VmData:     1284 kB
VmStk:       132 kB
VmExe:       900 kB
VmLib:      1976 kB
VmPTE:        24 kB
VmSwap:        0 kB
Threads:    1
SigQ:   19/13007
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000010000
SigIgn: 0000000000001006
SigCgt: 0000000043816ef9
CapInh: 0000000000000000
CapPrm: 0000001fffffffff
CapEff: 0000001fffffffff
CapBnd: 0000001fffffffff
Seccomp:    0
Cpus_allowed:   3
Cpus_allowed_list:  0-1
voluntary_ctxt_switches:    274
nonvoluntary_ctxt_switches: 145
[Linux:/]$ 
Siva Kumar
  • 183
  • 1
  • 2
  • 7
  • I'm assuming you've checked your code to see if you're intentionally/accidentally exiting after a read failure. Assuming the parent is still alive, can you wait on the child and read the exit status and return code? – Ram Jan 31 '17 at 02:41
  • Parent process is a shell script file which launches the child process and wait on its PID. – Siva Kumar Jan 31 '17 at 07:44
  • The parent process is not aware about the Child's SIGBUS crash and still waiting on its PID. The child process gets SIGBUS when it tries to read from one of the PCI device registers. I am not exiting this child process andit goes to Zombie state as soon as the read failure happened. – Siva Kumar Jan 31 '17 at 07:59
  • I understand that the PCI hardwarewhich is mmaped to that address is not respondig. So, it is appropriate for the kernel (and only the kernel) to deal with them. They are not propagated to user level, because they are not software faults. We do not get a core dump (either kernel or user space) since it is not a software failure. – Siva Kumar Feb 18 '17 at 17:53

1 Answers1

0

I understand that the PCI hardware which is mmaped, is not responding. So, it is appropriate that only the kernel to deal with the error.

The error won't be propagated to user level, because this is not software fault. So, We do not get a core dump (either kernel or user space), since it is not a software failure.

The Machine check exception handler in the kernel tells what the hardware failure was, and what address/data is relevant (depending on the cause) - Need to be investigated from the hardware perspective further.

Siva Kumar
  • 183
  • 1
  • 2
  • 7