0

I have a program with 2 threads, one of them redrawing display (with ncurses), and another is running inout processing on a serial port, outputting some info in process.

I have found out that at some points second thread hangs for reasons unknown to me. How to get to the bottom of the issue if:

  1. I cannot debug what happens in second thread because libthread_db and libpthread do not match on my system and gdb refuses to provide threading debug.
  2. Thread that hangs performs processing with sequential calls to select and read on non-blocking file descriptor.
  3. After dropping into gdb with Cntrl-C and resuming the program, thread is unstuck; moreover, it then processes all data stuck in recieving buffer of serial port.

Are there any tips or tricks that will help me get to the bottom of the issue and determine reason for hanging?

Update. Running with strace netted me these lines in trace:

waitpid(-1, 0xbfdcdfd0, 0)           = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
--- SIGCONT (Continued) @ 0 (0) ---

As far as i can tell, that corresponds to times where i saw a hang in the program, suspended it with C-z and looked at trace file (where nothing new was written until whole program has finished). After every restart thread was unhanged.

So, that means there is a 'rogue' waitpid call. I know for sure that it is not present in bare form anywhere in my code. A pity gdb fails to put a breakpoint on it - must be an issue of stripped symbols somewhere.

Srv19
  • 3,458
  • 6
  • 44
  • 75
  • A guess: Since file descriptors are process wide,`select()` returns saying input is ready but the `fd` that's ready is not the one (because it "belongs" to another thread) you are doing `read()` immediately after `select()`? – P.P Oct 13 '16 at 15:13
  • If your debug information does not match, check that you have the debug package both installed and updated. I used to had that problem with Open SUSE and manually updating the debug package (through zypper, SUSE's package manager) solved the problem. – Jorge Bellon Oct 13 '16 at 15:16
  • See this question on how to debug libthread_db and libpthread mismatch http://stackoverflow.com/q/14364781/72178. – ks1322 Oct 13 '16 at 15:47
  • 1
    Use print statements to a debug file and flush after every print. When the program hangs again, open the file to see what the last message was. (And write meaningful information to the file, such as state, variabes, and the function that does the print.) – Paul Ogilvie Oct 13 '16 at 15:48

1 Answers1

1

Are there any tips or tricks that will help me get to the bottom of the issue and determine reason for hanging?

The obvious answer is to use strace on the hung thread to see what it's doing.

One common mistake is when you expect to read some number of bytes, and loop like this:

while (bytes_remaining > 0) {
  int n = read(..., bytes_remaining);
  if (n == -1) { 
    // handle read error ...
    break;
  }
  // save data we just got ...
  bytes_remaining -= n;
  // loop to read more data
}

The problem here is that read may return 0 on EOF, and you'll loop forever. In strace this will immediately be obvious.

If that's not it, the other thing you can do (assuming Linux) is attach GDB to the hung thread instead of the process.

Employed Russian
  • 199,314
  • 34
  • 295
  • 362
  • How does one attach GDB to a thread? P.S. I have run with `strace`, adding to the question – Srv19 Oct 14 '16 at 13:14