gdb back-trace doesn't show all the function call stacks for linux-5.10.0 or linux-5.10.122, why?

Question

Really strange thing happening here.. I can't see the full stack trace with 'bt' command in gdb. So I tried with fresh linux-5.10.122 source and qemu-6.2.0 source and it's happening too! (But it's not happening with linux-5.4.21 with defconfig, with qemu 5.1.0 or 6.2.0)

I would be grateful if somebody could check if this happens to other people or just me.

download linux-5.1.122 tarball from https://www.kernel.org/
uncompress it and set env variable ARCH=arm64, CROSS_COMPILE=aarch64-none-elf- , do "make defconfig" and "make -jnproc Image"
download qemu-6.2.0 from https://www.qemu.org/
uncompress it and do "mkdir build" "cd build" "../configure --target-list=aarch64-softmmu --enable-debug"
run qemu and wait for debugger to attach.
qemu-6.2.0/build/aarch64-softmmu/qemu-system-aarch64 -machine virt,gic-version=max,secure=off,virtualization=true -cpu max -kernel linux-5.10.112/arch/arm64/boot/Image -m 2G -nographic -netdev user,id=vnet,hostfwd=:127.0.0.1:0-:22,tftp=/srv/tftp -device virtio-net-pci,netdev=vnet -machine iommu=smmuv3 --append "root=/dev/ram init=/init nokaslr earlycon ip=dhcp hugepages=16" -s -S
run debugger, do "aarch64-none-elf-gdb linux-6.10.112/vmlinux -x gdb_script" (gdb_script content : target remote :1234 layout src b start_kernel b __driver_attach )

Now, in gdb, when you press 'c' twice, it'll stop at the first __driver_attach. (first one stops at start_kernel). When you are at __attach_driver, type 'bt'. See if you see the full function stack trace.
This is what I see.

(gdb) bt
#0  __driver_attach (dev=0xffff000002582810, data=0xffff800011dc2358 <dummy_regulator_driver+40>)
    at drivers/base/dd.c:1060
#1  0xffff8000107a3ed0 in bus_for_each_dev (bus=<optimized out>, start=<optimized out>,
    data=0xffff800011dc2358 <dummy_regulator_driver+40>, fn=0xffff8000107a6f60 <__driver_attach>)
    at drivers/base/bus.c:305
#2  0xd6d78000107a5c58 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

I used to see more than 20 stack frames but strangely I see only two. I can still see many stacks for linux-5.4.21 that I was working with in the past.
Could anyone check if this happens to anyone else too? Even though I can't see the whole stack frames, I think if I add BLK_DEV_RAM and set initramfs.cpio.gz in the linux build, the kernel will boot ok to the shell prompt. So linux is running ok but only the gdb can't show the stack levels.

My OS : ubuntu-20.04 5.13.0-35-generic

$ aarch64-none-elf-gdb --version
GNU gdb (GNU Toolchain for the A-profile Architecture 10.2-2020.11 (arm-10.16)) 10.1.90.20201028-git Copyright (C) 2020 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. It looks like as kern version increase, at some point there is problem in gdb 'bt' command?

ADD

I found CONFIG_DEBUG_FRAME_POINTER, CONFIG_DEBUG_INFO are already set by default. And I tried adding CONFIG_DEBUG_KERNEL, CONFIG_KGDB, CONFIG_GDB_SCRIPTS, CONFIG_STACKTRACE all to no avail. and I need to do it for arm64 qemu virt machine.

ADD2 01:10 4/26/2022 UTC

I found in another breakpoint case at __driver_attach,

(gdb) bt
#0  __driver_attach (dev=dev@entry=0xffff0000401d1810, data=data@entry=0xffff800011bbbbb8 <mxc_gpio_driver+40>) at drivers/base/dd.c:1046
#1  0xffff8000107684f8 in bus_for_each_dev (bus=0xffff800011cba910 <platform_bus_type>, start=0x0, data=0xffff800011bbbbb8 <mxc_gpio_driver+40>, fn=0xffff80001076b860 <__driver_attach>) at drivers/base/bus.c:307
#2  0xb8cd80001076a594 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

(gdb) x/5g $sp
0xffff800011dcbcc0: 0xffff800011dcbd20  0xb8cd80001076a594
0xffff800011dcbcd0: 0xffff80001076b860  0xffff800011bbbbb8
0xffff800011dcbce0: 0x0000000000000000

Because it's right after the pc reached the function __driver_attach, the sp is still not updated from previous function (bus_for_each_dev). And the first two values at the $sp are supposed to be the fp and lr of the previous function (see understanding aarch64 assembly function call, how is stack operated arm64 stores previous function's fp and lr at the bottom of new stack frame as it enters a function). The lr (link register, the address to return after this bus_for_each_dev function) is 0xb8cd80001076a594 which is weird (not a kernel address). The following 3 values are function arguments for bus_for_each_dev and they look correct.

ADD (08:20 27/04/2022 UTC)

I tried to break at driver_attach. It calls bus_for_each_dev and bus_for_each_dev calls __driver_attach. When I entered bus_for_each_dev, I checked the assembly code. It placed x29 and x30 at [sp, #-80]! (stp x29, x30, [sp, #-80]!) so I checked the value of x29(fp) and x30(lr). They were 0xffff800011efbd20 and 0xffff8000107a52f8 each. Those values were placed at the bottome of stack frame of bus_for_each_dev. Now inside the bus_for_each_dev function, I enter __driver_attach. At this point I checked the two values in $sp (The sp value is still that of bus_for_each_dev). They were 0xffff800011efbd20 (correct) and 0xc9a48000107a52f8 (wrong!). Why did the upper 16 bits changed??
And I soon found when the x29, x30 are wrtten at the new stack bottom, the upper 16bits of x30 are written with wrong values at the first place. So if I fix these 16 bit to correct value (0xffff usually, because top kernel address bits are 0xffff), the bt output shows more. The more x30 fix, the more stack frames I can see.. I have filed a bug to bugs.linaro.org so that an expert can check this.

The magic seems to be in `postcore_initcall(dma_atomic_pool_init);` which calls it during init. — stark, Apr 21 '22 at 13:13
Hi, thanks. but that is not a point where a crash happened but it was one of the breakpoints. And if I let the program run (on qemu machine), it boots ok until the shell prompt. I tried with the original 5.10.0-rc5 kernel code but it's the same. maybe I'll check from where this happens tomorrow. probably gdb setting problem? — Chan Kim, Apr 21 '22 at 14:19
if I do the same thing with kernel 5.4.21, everything works normal. But with the newer 5.10.0-rc5 I have this problem. Of course .config files are different. (I tried doing `make oldconfig` from the old .config from 5.4.21, but the same..) — Chan Kim, Apr 22 '22 at 03:03
Please do not edit solution announcements into the question. Accept (i.e. click the "tick" next to it) one of the existing answer, if there are any. You can also create your own answer, and even accept it, if your solution is not yet covered by an existing answer. Compare https://stackoverflow.com/help/self-answer — Yunnosch, Apr 28 '22 at 06:29
Especially since the existing answer does not seem to really solve the problem, only avoid it. — Yunnosch, Apr 28 '22 at 06:45
@Yunnosch I read this pointer authentication feature (hardware feature added from arm v8.3 architecture) is to defend the system from return address attack. And with defconfig, this feature is used in modern arm64 processors and the hardware alters the return address (lr register) in the stack. so I think my answer is the only solution though I said it 'avoids' the problem. I don't think gdb can support this de-authentication in itself, it's overdesign. Curious if someone will come up with a better answer. — Chan Kim, Apr 28 '22 at 08:30

score 0 · Accepted Answer · answered Apr 27 '22 at 09:21

I just found out by turning CONFIG_ARM64_PTR_AUTH off in armv8.3 when building linux, I can avoid this problem. (I noticed the instruction ‘pacia’ at the start of function assembly code) (I asked kernelnewbies and qemu-discuss email list but experts don't respond often..) Hope this is helpful to someone later.

gdb back-trace doesn't show all the function call stacks for linux-5.10.0 or linux-5.10.122, why?

ADD

ADD2 01:10 4/26/2022 UTC

ADD (08:20 27/04/2022 UTC)

1 Answers1