2

I am trying to find the cause of a segfault, and narrowed it to the PLT using gdb's btrace. The segfault occurs during a jump from the PLT to the GOT, which I interpret to signify that the PLT became corrupted during execution. Based on the analysis presented below, is this interpretation correct? What are likely culprits for corruption of the PLT? Stack overflow? I believe that installing a watchpoint on the GOT address could be helpful in this instance. Would watch -l 0x55555562f048 be the correct approach? Other ideas for debugging are welcome.


For context, the segfault occurs during a call to strlen in function foo:

int foo(char * path, ...) {
    ...
    if (strlen(path) >= PATH_MAX) {

The corresponding lines of assembly are:

0x58114 <foo+212>    cmpq   $0x0,-0x4c8(%rbp)
0x5811c <foo+220>    jne    0x5812a <foo+234>
0x5811e <foo+222>    lea    0xb96fb(%rip),%rdi        # 0x111820
0x58125 <foo+229>    callq  0x376c0 <__ubsan_handle_nonnull_arg@plt>
0x5812a <foo+234>    mov    -0x4c8(%rbp),%rax
0x58131 <foo+241>    mov    %rax,%rdi
0x58134 <foo+244>    callq  0x37090 <strlen@plt>

First, path is compared to NULL (cmpq $0x0,-0x4c8(%rbp)), which I believe is solely added for ubsan instrumentation in this case. That branch was not followed, and the program jumped to <foo+234>, where it set up for the strlen call, by moving path to rax and then rdi, and finally calling strlen@plt. Running record btrace in gdb just before this produced this instruction history:

(gdb) record btrace
(gdb) c
136        0x00005555555ac114 <foo+212>:        cmpq   $0x0,-0x4c8(%rbp)
137        0x00005555555ac11c <foo+220>:        jne    0x5555555ac12a <foo+234>
138        0x00005555555ac12a <foo+234>:        mov    -0x4c8(%rbp),%rax
139        0x00005555555ac131 <foo+241>:        mov    %rax,%rdi
140        0x00005555555ac134 <foo+244>:        callq  0x55555558b090 <strlen@plt>
141        0x000055555558b090 <strlen@plt+0>:   jmpq   *0xa3fb2(%rip)  # 0x55555562f048 <strlen@got.plt>

Here, we see that path was not null, and so the program jumped to <foo+234>, and set up for the strlen call (mov, mov, callq <strlen@plt>). The final instruction executed was jmpq *0xa3fb2(%rip) to the entry for strlen in the GOT (strlen@got.plt), whereupon the program crashed and gdb lost context (reporting that it Cannot find bounds of current function).

user001
  • 1,850
  • 4
  • 27
  • 42
  • "What are likely culprits for corruption of the PLT". Bugs in your code. But it's impossible to say more without actually seeing your code. By all means present your attempted debugging and analysis but you also need to show the code. Your analysis could be right but could also be totally wrong. But we can't verify either way without a [minimal verifiable example](https://stackoverflow.com/help/minimal-reproducible-example). – kaylum Jan 11 '20 at 21:56
  • @kaylum: It's certainly a bug, but I was wondering if particular types of bugs would be more likely to be associated. Unfortunately, it's simply too large a code base to be able to produce a minimal working example, though I agree that that would be of great benefit in finding the source. – user001 Jan 11 '20 at 21:59
  • You are unlikely to get to get any answer that will help you get to the root cause. The best you'll get are general/vague answers such as out of bounds accesses, accessing freed memory, stack overflow, etc. All which are the same for any memory corruption. So unless you can provide an example of the code there isn't much hope of getting an answer. Having said that have you tried using tools such as [valgrind](http://valgrind.org) to help you find the problem? – kaylum Jan 11 '20 at 22:05
  • @kaylum: Yes, unfortunately both `valgrind` and `AddressSanitizer` have proven to be unhelpful in this case. Upon failure, `valgrind` reports a `Jump to the invalid address stated on the next line` and `AddressSanitizer` reports `SEGV on unknown address`, but the `btrace` analysis has so far been more informative. I'm hoping people might have ideas on debugging strategies like particular registers or addresses to watch. – user001 Jan 11 '20 at 22:11
  • 1
    `jmpq *0xa3fb2(%rip)` reads a pointer at the address `rip+0xa3fb2` and _then_ jumps to this pointer it read. In GDB, immediately before executing that `jmpq`, you should `x/gx $rip+0xa3fb2` and check whether it is in fact a pointer to anything sane. – Iwillnotexist Idonotexist Jan 11 '20 at 22:21
  • @IwillnotexistIdonotexist: Thanks, I will try that. This function gets executed many times and only fails sometimes. Is it possible to watch `rip+0xa3fb2` for an invalid address in `gdb` (i.e., check `*[rip+0xa3fb2]` each time the program counter gets just beyond the preceding instruction (`callq `))? (This function runs many times, and fails randomly, so it would be nice to trap it when it's going to fail.) – user001 Jan 11 '20 at 22:27
  • 1
    There are potentially many GOTs in a program; Which one is faulty I don't know. If it is always the same function segfaulting, only one GOT needs to be watched; For that you would breakpoint the function, reach this `strlen` call, step to the `jmpq`, then `awatch ADDR` where `ADDR` is the source of the pointer read by `jmpq` (this can change from program execution to program execution!). Another thing to try is to recompile all your code using the linker flag `-z relro`, which marks relocations as read-only once they've been made. If anything is corrupting them, it will crash immediately. – Iwillnotexist Idonotexist Jan 11 '20 at 22:32
  • @IwillnotexistIdonotexist: Thanks, those are great ideas. I was wondering if these tables could be made read-only after fixing up, but was not aware of the `ld` option you mentioned (`-z relro`). With this, it should be possible to find the instruction that tries to write to the GOT once it has been marked read-only? – user001 Jan 11 '20 at 22:42
  • 1
    @user001 Correct, it should, since the relocations are set once at startup and then frozen by the dynamic loader by making that area of memory read-only. – Iwillnotexist Idonotexist Jan 11 '20 at 22:44
  • @IwillnotexistIdonotexist: The program still segfaults with `gcc -Wl,-z,relro`. Perhaps it means that the initial fixing up of the jump table is the problem? Today, the strlen `jmpq` pointer is read from `$rip + 0x7af5a + 0x6`, examination of which shows `0x5555555d10a0 : 0x0000000000002146`. I tried setting up an `awatch -l ` using as addresses both the source of the pointer (`0x5555555d10a0`) and what it points to (`0x0000000000002146`), but in either case, `gdb` warns `Cannot watch constant value.` Is this because the memory is read-only (`-z relro`)? – user001 Jan 13 '20 at 01:10
  • 0x2146 is not a valid pointer of any type. It looks like an offset. And you may have to use `awatch *0xADDR` to tell watch to view the memory at that address. – Iwillnotexist Idonotexist Jan 13 '20 at 01:28
  • @IwillnotexistIdonotexist: Yes, you are right about the invalidity of the pointer -- the program segfaulted at that very instruction. Thanks for pointing out my error in using `awatch`. – user001 Jan 13 '20 at 01:33

0 Answers0