1

My Raspberry Pi 4B is dying every time it does something (for example, when backup job starts). I'm running Arch Linux (armv7l) on it. The memory usage is always below 15%.

Below is the log, including an output from free -hw, which logged 7 seconds before OOM.

net-restart.sh is a simple bash script. The most complicated thing it does is ping, so there's no reason for it to cause OOM when there's more than 3 GiB free. Sometimes it's triggered by PostgreSQL vacuum service, sometimes rsync-based backup. When it goes OOM, it just starts killing one process after another until it dies completely.

I have upgraded the kernel (and other stuff) few times since this started to happen. And there was no SW change before it started. A HW problem?

Btw, I have also tried to add swap (2 GiB), but it didn't help.

23:00:02 free[10890]:                total        used        free      shared     buffers       cache   available
23:00:02 free[10890]: Mem:           3,7Gi        82Mi       3,2Gi       2,0Mi       0,0Ki       442Mi       3,6Gi
23:00:02 free[10890]: Swap:             0B          0B          0B

23:00:09 kernel: oom_kill_process: 13 callbacks suppressed
23:00:09 kernel: net-restart.sh invoked oom-killer: gfp_mask=0x400dc0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), order=0, oom_score_adj=0
23:00:09 kernel: CPU: 2 PID: 10992 Comm: net-restart.sh Tainted: G         C         6.1.14-1-rpi-ARCH #1
23:00:09 kernel: Hardware name: BCM2711
23:00:09 kernel:  unwind_backtrace from show_stack+0x18/0x1c
23:00:09 kernel:  show_stack from dump_stack_lvl+0x90/0xac
23:00:09 kernel:  dump_stack_lvl from dump_header+0x54/0x1fc
23:00:09 kernel:  dump_header from oom_kill_process+0x23c/0x248
23:00:09 kernel:  oom_kill_process from out_of_memory+0x218/0x34c
23:00:09 kernel:  out_of_memory from __alloc_pages+0xa98/0x1044
23:00:09 kernel:  __alloc_pages from __pmd_alloc+0x3c/0x1d8
23:00:09 kernel:  __pmd_alloc from copy_page_range+0xcac/0xcc4
23:00:09 kernel:  copy_page_range from dup_mm+0x440/0x5a4
23:00:09 kernel:  dup_mm from copy_process+0xda0/0x164c
23:00:09 kernel:  copy_process from kernel_clone+0xac/0x3a8
23:00:09 kernel:  kernel_clone from sys_clone+0x78/0x9c
23:00:09 kernel:  sys_clone from ret_fast_syscall+0x0/0x1c
23:00:09 kernel: Exception stack(0xf08b1fa8 to 0xf08b1ff0)
23:00:09 kernel: 1fa0:                   b6fd0088 00000001 01200011 00000000 00000000 00000000
23:00:09 kernel: 1fc0: b6fd0088 00000001 b6efae58 00000078 bea210fc 0055d2bc bea2107c 005844e0
23:00:09 kernel: 1fe0: b6fd05a0 bea20f08 b6e2d260 b6e2d684
23:00:09 kernel: Mem-Info:
23:00:09 kernel: active_anon:7451 inactive_anon:603 isolated_anon:0
                                                 active_file:39567 inactive_file:70065 isolated_file:0
                                                 unevictable:0 dirty:143 writeback:0
                                                 slab_reclaimable:3166 slab_unreclaimable:6791
                                                 mapped:23163 shmem:594 pagetables:267
                                                 sec_pagetables:0 bounce:0
                                                 kernel_misc_reclaimable:0
                                                 free:848488 free_pcp:30 free_cma:80063
23:00:09 kernel: Node 0 active_anon:29804kB inactive_anon:2412kB active_file:158268kB inactive_file:280260kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:92652kB dirty:572kB writeback:0kB shmem:2376kB writeback_tmp:0kB kernel_stack:2360kB pagetables:1068kB sec_pagetab>
23:00:09 kernel: DMA free:323468kB boost:0kB min:3236kB low:4044kB high:4852kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:8076kB inactive_file:279068kB unevictable:0kB writepending:0kB present:786432kB managed:664228kB mlocked:0kB bounce:0kB free_pcp:120kB >
23:00:09 kernel: lowmem_reserve[]: 0 0 3188 3188
23:00:09 kernel: DMA: 143*4kB (UMEC) 119*8kB (UMEC) 68*16kB (UMEC) 23*32kB (UEC) 1*64kB (C) 1*128kB (C) 0*256kB 1*512kB (C) 0*1024kB 0*2048kB 78*4096kB (C) = 323540kB
23:00:09 kernel: 110236 total pagecache pages
23:00:09 kernel: 0 pages in swap cache
23:00:09 kernel: Free swap  = 0kB
23:00:09 kernel: Total swap = 0kB
23:00:09 kernel: 1012736 pages RAM
23:00:09 kernel: 816128 pages HighMem/MovableOnly
23:00:09 kernel: 30551 pages reserved
23:00:09 kernel: 81920 pages cma reserved
23:00:09 kernel: Tasks state (memory values in pages):
23:00:09 kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
23:00:09 kernel: [    242]     0   242    12050     4296    98304        0          -250 systemd-journal
23:00:09 kernel: [    243]     0   243     7022     1837    61440        0         -1000 systemd-udevd
23:00:09 kernel: [    516]    81   516     2843     1047    49152        0          -900 dbus-daemon
23:00:09 kernel: [    550]     0   550     2422     1664    45056        0         -1000 sshd
23:00:09 kernel: [    554]     0   554   196576     7435   167936        0          -999 containerd
23:00:09 kernel: [    651]     0   651   203978    13307   245760        0          -500 dockerd
23:00:09 kernel: [  10882]   978 10882     4543     2764    65536        0             0 systemd-resolve
23:00:09 kernel: [  10888]     0 10888     1097      201    36864        0             0 agetty
23:00:09 kernel: [  10889]   977 10889     6022      965    65536        0             0 systemd-timesyn
23:00:09 kernel: [  10890]     0 10890     2676      341    49152        0             0 free
23:00:09 kernel: [  10897]     0 10897     3543     1493    57344        0             0 systemd-logind
23:00:09 kernel: [  10992]     0 10992     2169      824    40960        0             0 net-restart.sh
23:00:09 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=net-restart.service,mems_allowed=0,global_oom,task_memcg=/,task=systemd-resolve,pid=10882,uid=978
23:00:09 kernel: Out of memory: Killed process 10882 (systemd-resolve) total-vm:18172kB, anon-rss:1548kB, file-rss:9508kB, shmem-rss:0kB, UID:978 pgtables:64kB oom_score_adj:0

I've tried to reduce memory usage of my rsync backup, I've added a service that logs memory stats to see what's going on, I've tried to add swap. Still puzzled.

jswcom
  • 11
  • 2
  • Did you try to disable memory overcommit? – dimich Mar 15 '23 at 23:15
  • I tried it now and it didn't help. :-( – jswcom Mar 17 '23 at 08:19
  • It's not for resolving the issue but for narrowing it. Without overcommit OOM should happen when process tries to allocate memory, not when tries to use it. – dimich Mar 17 '23 at 10:00
  • I see. Well, in any case, I wouldn't expect it to be triggered with more than 3 GiB available. Btw, it doesn't seem to be HW related, because today I found out that another of my Raspberries has the same issue. Same HW, same OS, same kernel - just different SW. – jswcom Mar 17 '23 at 10:52
  • @jswcom Did you revert back to the 32bit kernel with `arm_64bit=0`? I just did that because I had trouble with PHP/PDO driver (I was automatically updated to the 64bit kernel). Everything worked but now I start getting those "xxx invoked oom-killer" too. With random programs. I have a 8 GB RPI4 with lots of memory free (5,9GB free). I think there is something seriously wrong with the new armv7l 6.1.19-v7l+ kernel. (Another of my RPI4 still runs aarch64 and is fine, but I don't need PHP/PDO there) I'll try reverting back to the 64bit one for a moment to see if the problems go away. – Rik Mar 22 '23 at 10:56
  • Reverting back to the 6.1.19-v8+ aarch64 kernel seems to fix things. (Now I need a solution for the PHP/PDO "Bus error" problem :( ) Problem kernel for me is `6.1.19-v7l+ #1637 SMP Tue Mar 14 11:07:55 GMT 2023 armv7l`. The one that works is `6.1.19-v8+ #1637 SMP PREEMPT Tue Mar 14 11:11:47 GMT 2023 aarch64 GNU/Linux`. – Rik Mar 22 '23 at 11:11
  • @Rik, I'm running 32b all the time, so this was not an option for me. Someone has mentioned somewhere that it was a kernel issue and it worked fine with 5.x. However, I was not able to revert to it. The fun part was that I was getting OOM when I was trying to build an AUR package with a legacy kernel. Good news is that it was fixed in 6.1.21-2. – jswcom Apr 05 '23 at 14:16

2 Answers2

0

I have the same problem with rsync since about a month. (I just saw processes dying, took a bit to localize that rsync causes it) Also I saw a similar effect sometimes calling 'du -hs'. Rsync is from SD card to USB3 drive. Both drives seem healthy.

 rsync invoked oom-killer: gfp_mask=0xcd0(GFP_KERNEL|__GFP_RECLAIMABLE), order=0, oom_score_adj=0
Mar 19 04:08:56 rasp4 kernel: CPU: 1 PID: 1164 Comm: rsync Tainted: G         C         6.1.19-2-rpi-ARCH #1
Mar 19 04:08:56 rasp4 kernel: Hardware name: BCM2711
Mar 19 04:08:56 rasp4 kernel:  unwind_backtrace from show_stack+0x18/0x1c
Mar 19 04:08:56 rasp4 kernel:  show_stack from dump_stack_lvl+0x90/0xac
Mar 19 04:08:56 rasp4 kernel:  dump_stack_lvl from dump_header+0x54/0x1fc
Mar 19 04:08:56 rasp4 kernel:  dump_header from oom_kill_process+0x23c/0x248
Mar 19 04:08:56 rasp4 kernel:  oom_kill_process from out_of_memory+0x218/0x34c
Mar 19 04:08:56 rasp4 kernel:  out_of_memory from __alloc_pages+0xa98/0x1044
Mar 19 04:08:56 rasp4 kernel:  __alloc_pages from new_slab+0x384/0x43c
Mar 19 04:08:56 rasp4 kernel:  new_slab from ___slab_alloc+0x3e8/0xa0c
Mar 19 04:08:56 rasp4 kernel:  ___slab_alloc from kmem_cache_alloc_lru+0x4fc/0x640
Mar 19 04:08:56 rasp4 kernel:  kmem_cache_alloc_lru from __d_alloc+0x2c/0x1bc
Mar 19 04:08:56 rasp4 kernel:  __d_alloc from d_alloc+0x18/0x74
Mar 19 04:08:56 rasp4 kernel:  d_alloc from d_alloc_parallel+0x50/0x3b8
Mar 19 04:08:56 rasp4 kernel:  d_alloc_parallel from __lookup_slow+0x60/0x138
Mar 19 04:08:56 rasp4 kernel:  __lookup_slow from walk_component+0xf4/0x164
Mar 19 04:08:56 rasp4 kernel:  walk_component from path_lookupat+0x7c/0x1a4
Mar 19 04:08:56 rasp4 kernel:  path_lookupat from filename_lookup+0xc0/0x190
Mar 19 04:08:56 rasp4 kernel:  filename_lookup from vfs_statx+0x7c/0x168
Mar 19 04:08:56 rasp4 kernel:  vfs_statx from do_statx+0x70/0xb0
  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Mar 26 '23 at 19:17
0

According to https://archlinuxarm.org/forum/viewtopic.php?f=23&t=16377, this issue has been solved (or rather by-passed) in 6.1.21-2.

I finally got some time to test it (with current version - 6.1.21-3) and it seems to work fine.

jswcom
  • 11
  • 2