Most of the user space processes are ending up in D-state after the unit runs for around 3-4 days, the unit is running on ARM processor. From the top o/p we can see that processes that are in D-state are waiting on system calls "page_fault" and "squashfs_readpage". Utimately this leads to a watchdog reset. The processes that go into D-sate would take unusually long time to recover.
Following is the top o/p when the system ends up in trouble:
top - 12:00:11 up 3 days, 2:40, 3 users, load average: 2.77, 1.90, 1.72
Tasks: 250 total, 3 running, 238 sleeping, 0 stopped, 9 zombie
Cpu(s): 10.0% us, 75.5% sy, 0.0% ni, 0.0% id, 10.3% wa, 0.0% hi, 4.2% si
Mem: 191324k total, 188896k used, 2428k free, 2548k buffers
Swap: 0k total, 0k used, 0k free, 87920k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1003 root 20 0 225m 31m 4044 S 15.2 16.7 0:21.91 user_process_1
3745 root 20 0 80776 9476 3196 **D** 9.0 5.0 1:31.79 user_process_2
129 root 15 -5 0 0 0 S 7.4 0.0 0:27.65 **mtdblockd**
4624 root 20 0 3640 256 160 **D** 6.5 0.1 0:00.20 GetCounters_cus
3 root 15 -5 0 0 0 S 3.2 0.0 43:38.73 ksoftirqd/0
31363 root 20 0 2356 1176 792 R 2.6 0.6 40:09.58 top
347 root 30 10 0 0 0 S 1.9 0.0 28:56.04 **jffs2_gcd_mtd3**
1169 root 20 0 225m 31m 4044 S 1.9 16.7 39:31.36 user_process_1
604 root 20 0 0 0 0 S 1.6 0.0 27:22.76 user_process_3
1069 root -23 0 225m 31m 4044 S 1.3 16.7 20:45.39 user_process_1
4545 root 20 0 3640 564 468 S 1.0 0.3 0:00.08 GetCounters_cus
64 root 15 -5 0 0 0 **D** 0.3 0.0 0:00.83 **kswapd0**
969 root 20 0 20780 1856 1376 S 0.3 1.0 14:18.89 user_process_4
973 root 20 0 225m 31m 4044 S 0.3 16.7 3:35.74 user_process_1
1070 root -23 0 225m 31m 4044 S 0.3 16.7 16:41.04 user_process_1
1151 root -81 0 225m 31m 4044 S 0.3 16.7 23:13.05 user_process_1
1152 root -99 0 225m 31m 4044 S 0.3 16.7 8:48.47 user_process_1
One more interesting observation is that when the system lands up in this problem, we can consistently see "mtdblockd" process running in the top o/p. We have swap disabled on this unit. there is no apparent memory leak in the unit.
Any idea what could be the possible reasons, the processes are stuck in D-sates?