Debugging segfault with no apparent cause in gdb?

Question

gdb was reporting that my C code was crashing somewhere in malloc(), so I linked my code with Electric Fence to pinpoint the actual source of the memory error. Now my code is segfaulting much earlier, but gdb's output is even more confusing:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x30026b00 (LWP 4003)]
0x10007c30 in simulated_status (axis=1, F=0x300e7fa8, B=0x1003a520, A=0x3013b000, p=0x1003b258, XS=0x3013b000)
    at ccp_gch.c:799

EDIT: The full backtrace:

(gdb) bt
#0  0x10007c30 in simulated_status (axis=1, F=0x300e7fa8, B=0x1003a520, A=0x3013b000, p=0x1003b258, XS=0x3013b000)
    at ccp_gch.c:799
#1  0x10007df8 in execute_QUERY (F=0x300e7fa8, B=0x1003a520, iData=0x7fb615c0) at ccp_gch.c:836
#2  0x10009680 in execute_DATA_cmd (P=0x300e7fa8, B=0x7fb615cc, R_type=0x7fb615d0, iData=0x7fb615c0)
    at ccp_gch.c:1581
#3  0x10015bd8 in do_volley (client=13) at session.c:76
#4  0x10015ef4 in do_dialogue (v=12, port=2007) at session.c:149
#5  0x10016350 in do_session (starting_port=2007, ports=1) at session.c:245
#6  0x100056e4 in main (argc=2, argv=0x7fb618f4) at main.c:271

The relevant code (slightly modified due to reasons):

796  static uint32_t simulated_status(
797      unsigned axis, struct foo *F, struct bar *B, struct Axis *A, BAZ *p, uint64_t *XS)
798  {
799      uint32_t result = A->status;
800      *XS = get_status(axis);
801      if (!some_function(p)) {
802          ...

The obvious thing to check would be whether A->status is valid memory, but it is. Removing the assignment pushes the segfault to line 800, and removing that assignment causes some other assignment in the if-block to segfault. It looks as though either accessing an argument passed to the function or writing to a local variable is what's causing the segfault, but everything points to valid memory according to gdb.

How am I to interpret this? I've never seen anything like this before, so any suggestions / pointers in the right direction would be appreciated. I'm using GNU gdb 6.8-debian, Electric Fence 2.1, and running on a PowerPC 405 (uname reports Linux powerpmac 2.6.30.3 #24 [...] ppc GNU/Linux).

Do you have another machine to test? Does your code segfault outside gdb? — Mauren, Mar 28 '14 at 20:15
You could be stack smashing earlier and not know it. I've had similar problems in the past where I was indexing data out-of-bounds, but a later line of code actually triggered the break (ie: a `free()` call on valid memory). Can you try compiling your code with zero optimizations and the strong stack protector (`CCOPTS+=-O0 DEBUG=-g -fstack-protector-strong`)? This should cause your program to crash sooner than usual if you are stack smashing, and should point out where in your code you are trampling memory. — Cloud, Mar 28 '14 at 20:17
Offhand, what does Valgrind report? It would appear your corrupting your heap or your truly accessing memory that may appear valid but in-fact is already freed. And that call-stack is nowhere *near* complete. What does the *rest of it look like (all the way down to the initial thread proc)? And of course, you could be just plain out of stack space consumed by an activation stack that has boatloads of automatics. I would think that would rear fairly quickly. — WhozCraig, Mar 28 '14 at 20:18
There's one other machine I could test on, but it's connected to real (and expensive!) hardware and I'd like to avoid running buggy code on it. My code does segfault outside of gdb, in the same place according to the log file. — ashastral, Mar 28 '14 at 20:20
@Fraxtil: Can you please share the output of command "shell cat /proc/PID/maps" and "info r" at the time of seg fault encountered by GDB. — Mantosh Kumar, Mar 28 '14 at 20:23
WhozCraig: the machine in question doesn't have Valgrind installed, but I can try building it if necessary. It's a bit of a pain to put software on it due to the setup. Also, I've added the full backtrace, although I don't think it provides much useful information. — ashastral, Mar 28 '14 at 20:24
@Fraxtil: I can see that your first argument is of type "unsigned axis". This looks bit suspicious...it should be unsigned int or unsigned long.... right? — Mantosh Kumar, Mar 28 '14 at 20:29
@tmp: [here's](http://pastebin.com/UPnDYJth) the output you requested, plus the backtrace since the addresses are likely different. — ashastral, Mar 28 '14 at 20:35
@tmp: the argument's type is just `unsigned`, which I believe is equivalent to `unsigned int`. — ashastral, Mar 28 '14 at 20:37
@Dogbert: it seems my version of gcc doesn't have `-fstack-protector-strong`, and due to my setup, rebuilding it from source is a very nontrivial operation. I did try compiling without optimization, but it didn't have any effect. — ashastral, Mar 28 '14 at 20:50
@Dogbert: that one works, but it's still not segfaulting any sooner. — ashastral, Mar 28 '14 at 21:00
Valgrind is the first thing I would be running in this situation. — James M, Mar 28 '14 at 21:12
I'm working on cross-compiling Valgrind. I'll report back once I've made progress. — ashastral, Mar 28 '14 at 21:21
Well, it turns out cross-compiling Valgrind for an embedded system is exactly as difficult as it sounds. Guess I'll try to work something else out. — ashastral, Mar 28 '14 at 23:51

score 0 · Answer 1 · answered Mar 30 '14 at 12:18

I'm guessing, but your symptoms are similar to what could happen in a stack overflow situation. The -fstack-protector suggestion in the comments is on the right track here. I'd recommend adding the -fstack-check option as well.

If the SEGV is occurring because of writes to the guard page protecting the stack then an info registers and info frame in gdb would help confirm if this is the case.

Debugging segfault with no apparent cause in gdb?

1 Answers1