129

I have the following stack trace. Is it possible to make out anything useful from this for debugging?

Program received signal SIGSEGV, Segmentation fault.
0x00000002 in ?? ()
(gdb) bt
#0  0x00000002 in ?? ()
#1  0x00000001 in ?? ()
#2  0xbffff284 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) 

Where to start looking at the code when we get a Segmentation fault, and the stack trace is not so useful?

NOTE: If I post the code, then the SO experts will give me the answer. I want to take the guidance from SO and find the answer myself, so I'm not posting the code here. Apologies.

Sangeeth Saravanaraj
  • 16,027
  • 21
  • 69
  • 98
  • Probably your program jumped off into the weeds - can you recover anything from the stack pointer? – Carl Norum Mar 21 '12 at 17:36
  • 2
    Another thing to consider is if the frame pointer is set correctly. Are you building without optimizations or passing a flag like `-fno-omit-frame-pointer`? Also, for memory corruption, `valgrind` might be a more appropriate tool, if it's an option for you. – FatalError Mar 21 '12 at 17:36

6 Answers6

164

Those bogus adresses (0x00000002 and the like) are actually PC values, not SP values. Now, when you get this kind of SEGV, with a bogus (very small) PC address, 99% of the time it's due to calling through a bogus function pointer. Note that virtual calls in C++ are implemented via function pointers, so any problem with a virtual call can manifest in the same way.

An indirect call instruction just pushes the PC after the call onto the stack and then sets the PC to the target value (bogus in this case), so if this is what happened, you can easily undo it by manually popping the PC off the stack. In 32-bit x86 code you just do:

(gdb) set $pc = *(void **)$esp
(gdb) set $esp = $esp + 4

With 64-bit x86 code you need

(gdb) set $pc = *(void **)$rsp
(gdb) set $rsp = $rsp + 8

Then, you should be able to do a bt and figure out where the code really is.

The other 1% of the time, the error will be due to overwriting the stack, usually by overflowing an array stored on the stack. In this case, you might be able to get more clarity on the situation by using a tool like valgrind

Chris Dodd
  • 119,907
  • 13
  • 134
  • 226
  • Is there a way to get the bt when you are not running the program, you just have a core dump? – George Feb 05 '14 at 17:26
  • 6
    @George: `gdb executable corefile` will open gdb with the executable and core file, at which point you can do `bt` (or the above commands followed by `bt`)... – Chris Dodd Mar 27 '14 at 18:58
  • @ChrisDodd I believe I tried that than and couldn't get the bt. Unfortunately it's been a while since then and I don't have that corefile anymore. Thank you anyhow. – George Mar 31 '14 at 09:54
  • When I try to change it I get following error : "Attempt to assign to an unmodifiable value." Command - set $pc = *(void **)$sp This is in arm 64 bit – Sandeep Apr 17 '15 at 12:10
  • 2
    @mk.. ARM doesn't use the stack for return addresses -- it uses the link register instead. So it generally doesn't have this problem, or if it does, it is usually due to some other stack corruption. – Chris Dodd Apr 17 '15 at 22:10
  • Thank you. Yes I remember it now. Do you have any suggestions wrt ARM for the same problem. If LR is holding invalid values. If you want me to post a new question for that, I will. Please let me know. – Sandeep Apr 20 '15 at 02:22
  • 2
    Even in ARM, I think, all the General purpose registers and LR are stored in stack before the called function starts executing. Once the function finishes, the value of LR is poped into PC and hence the function returns. So If stack is corrupted, we can see a wrong value is PC right? In this case may be adjusting stack pointer will lead to appropriate stack and help to debug the issue. What do you think? pls let me know your thoughts. Thank you. – Sandeep Apr 20 '15 at 02:37
  • I've got "Cannot access memory at address 0xbe7fffe8" on the first command. – Vincent Nov 13 '15 at 20:55
  • 1
    What means bogus? – Danny Lo May 17 '17 at 11:12
  • 1
    on ARM this attempt to cast esp register triggers "invalid cast" error – unresolved_external Oct 01 '18 at 08:01
  • 5
    ARM is not x86 -- its stack pointer is called `sp`, not `esp` or `rsp`, and its call instruction stores the return address in the `lr` register, not on the stack. So for ARM, all you really need to undo the call is `set $pc = $lr`. If `$lr` is invalid, you have a much harder problem to unwind. – Chris Dodd Oct 02 '18 at 04:38
  • Unfortunately Valgrind does not seem to detect stack corruption. You might be able to use AddressSanitizer. – Timmmm Jul 08 '22 at 10:29
48

If the situation is fairly simple, Chris Dodd's answer is the best one. It does look like it jumped through a NULL pointer.

However, it is possible the program shot itself in the foot, knee, neck, and eye before crashing—overwrote the stack, messed up the frame pointer, and other evils. If so, then unraveling the hash is not likely to show you potatoes and meat.

The more efficient solution will be to run the program under the debugger, and step over functions until the program crashes. Once a crashing function is identified, start again and step into that function and determine which function it calls causes the crash. Repeat until you find the single offending line of code. 75% of the time, the fix will then be obvious.

In the other 25% of situations, the so-called offending line of code is a red herring. It will be reacting to (invalid) conditions set up many lines before—maybe thousands of lines before. If that is the case, the best course chosen depends on many factors: mostly your understanding of the code and experience with it:

  • Perhaps setting a debugger watchpoint or inserting diagnostic printf's on critical variables will lead to the necessary A ha!
  • Maybe changing test conditions with different inputs will provide more insight than debugging.
  • Maybe a second pair of eyes will force you to check your assumptions or gather overlooked evidence.
  • Sometimes, all it takes is going to dinner and thinking about the gathered evidence.

Good luck!

Community
  • 1
  • 1
wallyk
  • 56,922
  • 16
  • 83
  • 148
  • 15
    If a second pair of eyes are not available then rubber ducks are well proven as alternatives. – Matt Mar 21 '12 at 18:52
  • 2
    Writing off the end of a buffer can do it, too. It might not crash where you write off the end of the buffer, but when you step out of the function, then it dies. – phyatt Sep 23 '16 at 20:15
  • May be useful: [GDB: Automatic 'Next'ing](https://stackoverflow.com/questions/5812411/gdb-automatic-nexting/5813439#5813439) – user202729 Oct 12 '18 at 04:22
33

Assuming that the stack pointer is valid...

It may be impossible to know exactly where the SEGV occurs from the backtrace -- I think the first two stack frames are completely overwritten. 0xbffff284 seems like a valid address, but the next two aren't. For a closer look at the stack, you can try the following:

gdb$ x/32ga $rsp

or a variant (replace the 32 with another number). That will print out some number of words (32) starting from the stack pointer of giant (g) size, formatted as addresses (a). Type 'help x' for more info on format.

Instrumenting your code with some sentinel 'printf''s may not be a bad idea, in this case.

manabear
  • 431
  • 3
  • 6
  • Incredibly helpful, thank you -- I had a stack that only went back three frames and then hit "Backtrace stopped: previous frame identical to this frame (corrupt stack?)"; I've done something exactly like this in code in a CPU exception handler before, but couldn't remember other than `info symbol` how to do this in gdb. – leander Mar 08 '13 at 19:05
  • 26
    FWIW on 32-bit ARM devices: `x/256wa $sp` =) – leander Mar 08 '13 at 19:05
  • 2
    @leander Could you tell me what is X/256wa? I need it for 64-bit ARM. In general it will be helpful if you can explain what is it. – Sandeep Apr 17 '15 at 12:15
  • 5
    Per the answer, 'x'=examine memory location; it prints out a number of 'w'=words (in this case, 256), and interprets them as 'a'=addresses. There's more info in the GDB manual at https://sourceware.org/gdb/current/onlinedocs/gdb/Memory.html#Memory . – leander Apr 17 '15 at 23:18
7

Look at some of your other registers to see if one of them has the stack pointer cached in them. From there, you might be able to retrieve a stack. Also, if this is embedded, quite often stack is defined at a very particular address. Using that, you can also sometimes get a decent stack. This all assumes that when you jumped to hyperspace, your program didn't puke all over memory along the way...

Michael Dorgan
  • 12,453
  • 3
  • 31
  • 61
4

If it's a stack overwrite, the values may well correspond to something recognisable from the program.

For example, I just found myself looking at the stack

(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x000000000000342d in ?? ()
#2  0x0000000000000000 in ?? ()

and 0x342d is 13357, which turned out to be a node-id when I grepped the application logs for it. That immediately helped narrow down candidate sites where the stack overwrite might have occurred.

Craig Ringer
  • 307,061
  • 76
  • 688
  • 778
0

funny...we had the exact same thing going on with a driver in an old C app here. the top 2 stack trace value pointers in hex were data bytes being read in off the port. I just happened to notice one because it was familiar.

user3053087
  • 87
  • 1
  • 7