How to debug a rare deadlock?

Question

I'm trying to debug a custom thread pool implementation that has rarely deadlocks. So I cannot use a debugger like gdb because I have click like 100 times "launch" debugger before having a deadlock.

Currently, I'm running the threadpool test in an infinite loop in a shell script, but that means I cannot see variables and so on. I'm trying to std::cout data, but that slow down the thread and reduce the risk of deadlocks meaning that I can wait like 1hour with my infinite before getting messages. Then I don't get the error, and I need more messages, which means waiting one more hour...

How to efficiently debug the program so that its restart over and over until it deadlocks ? (Or maybe should I open another question with all the code for some help ?)

Thank you in advance !

Bonus question : how to check everything goes fine with a std::condition_variable ? You cannot really tell which thread are asleep or if a race condition occurs on the wait condition.

Debugging deadlocks is hard. 1st of all make sure that all mutextes and semaphores are unlocked in the same order they will be locked in sequence. `std::cout` is probably a bad tool, since it changes runtime behavior and timing. — πάντα ῥεῖ, Jan 04 '15 at 22:15
[Helgrind](http://valgrind.org/docs/manual/hg-manual.html) is probably going to help here. Otherwise you could run your program normally, get the deadlock, and then attack gdb to the deadlocked program (see [here](http://stackoverflow.com/questions/14370972/how-to-attach-a-process-in-gdb)) — sbabbi, Jan 04 '15 at 22:23
@πάνταῥεῖ Why the unlocking must be done in the same order? In most programs it will be done in reverse order of locking for convenience but I cannot see any possibility of errors when you unlock in arbitrary order (the deadlocks are usually when you attempt to lock something - not unlock). — Maciej Piechotka, Jan 04 '15 at 22:43
@πάνταῥεῖ I'm still not sure why unlocking order is a problem - as long a lock hierarchy is not violated during locking it should not make any difference as far as I can tell (unless I make some error during interleaving in my head). — Maciej Piechotka, Jan 04 '15 at 22:50
@MaciejPiechotka Sorry, I've deleted my silly comment. What I mean is, that different threads should use the exactly same order, when trying to lock any synchronization features (and unlock them in order accordingly of course). — πάντα ῥεῖ, Jan 04 '15 at 22:55
@πάνταῥεῖ No there are many valid use cases why the order might be different. For example you traverse a tree, first you lock root and then you lock a child, release the root but keep the child lock. You narrow your locked space by traversing down the tree allowing more parallelism. — Lothar, Feb 03 '21 at 06:25

Maciej Piechotka · Accepted Answer · 2015-01-04T22:44:42.127

There are 2 basic ways:

Automate the running of program under debugger. Using gdb program -ex 'run <args>' -ex 'quit' should run the program under debugger and then quit. If the program is still alive in one form or another (segfault, or you broke it manually) you will be asked for confirmation.
Attach the debugger after reproducing the deadlock. For example gdb can be run as gdb <program> <pid> to attach to running program - just wait for deadlock and attach then. This is especially useful when attached debugger causes timing to be changed and you can no longer repro the bug.

In this way you can just run it in loop and wait for result while you drink coffee. BTW - I find the second option easier.

Agree completely with attaching. That's what I always do & 99% of the time it tells me what I want. — Component 10, Jan 04 '15 at 22:42

score 6 · Answer 2 · edited Apr 13 '17 at 12:49

6

If this is some kind of homework - restarting again and again with more debug will be a reasonable approach.

If somebody pays money for every hour you wait, they might prefer to invest in a software that supports replay-based debugging, that is, a software that records everything a program does, every instruction, and allows you to replay it again and again, debugging back and forth. Thus instead of adding more debug, you record a session during which a deadlock happens, and then start debugging just before the deadlock happened. You can step back and forth as often as you want, until you finally found the culprit.

The software mentioned in the link actually supports Linux and multithreading.

edited Apr 13 '17 at 12:49

Community

1
1

answered Jan 04 '15 at 22:27

Hans Klünder

2,176
12
8

It's a personal project, so it's between the two :) I didn't know replay-based debugging, and it's a wise advise. I want to accept this answer, but it doesn't precisely answer to the question and I cannot accept two, so I give +1 instead. Thank you so much ! – Lærne Jan 05 '15 at 14:59
There is also now a worthy open source implementation: Mozilla rr: https://stackoverflow.com/questions/27770896/how-to-debug-rare-deadlock/50073993#50073993 – Ciro Santilli OurBigBook.com Apr 28 '18 at 07:24

Ciro Santilli OurBigBook.com · Answer 3 · 2018-04-29T15:06:51.843

Mozilla rr open source replay based debugging

https://github.com/mozilla/rr

Hans mentioned replay based debugging, but there is a specific open source implementation that is worth mentioning: Mozilla rr.

First you do a record run, and then you can replay the exact same run as many times as you want, and observe it in GDB, and it preserves everything, including input / output and thread ordering.

The official website mentions:

rr's original motivation was to make debugging of intermittent failures easie

Furthermore, rr enables GDB reverse debugging commands such as reverse-next to go to the previous line, which makes it much easier to find the root cause of the problem.

Here is a minimal example of rr in action: How to go to the previous line in GDB?

Nice tool, I didn't know it. I'll try to use it in the future ! — Lærne, Apr 29 '18 at 14:33

score 2 · Answer 4 · edited May 23 '17 at 10:30

2

You can run your test case under GDB in a loop using the command shown in https://stackoverflow.com/a/8657833/341065: gdb --eval-command=run --eval-command=quit --args ./a.out.

I have used this myself: (while gdb --eval-command=run --eval-command=quit --args ./thread_testU ; do echo . ; done).

Once it deadlocks and does not exit, you can just interrupt it by CTRL+C to enter into the debugger.

edited May 23 '17 at 10:30

Community

1
1

answered Jan 04 '15 at 22:35

wilx

17,697
6
59
114

Thanks for the explicit commands. I personally tried with `-batch` but hitting CTRL+C didn't give me the interactive session, obviously. What you gave is exactly what I wanted to do in the first place. Thank you. – Lærne Jan 05 '15 at 15:03
Any way to get notified if one of the runs takes too long? https://stackoverflow.com/questions/50066342/how-to-get-a-notification-if-a-given-process-runs-for-longer-than-a-specified-ti – Ciro Santilli OurBigBook.com Apr 28 '18 at 07:35

deb0ch · Answer 5 · 2015-08-13T11:02:23.127

An easy quick debug to find deadlocks is to have some global variables that you modify where you want to debug, and then print it in a signal handler. You can use SIGINT (sent when you interrupt with ctrl+c) or SIGTERM (sent when you kill the program):

int dbg;

int multithreaded_function()
{
  signal(SIGINT, dbg_sighandler);
  ...
  dbg = someVar;
  ...  
}

void  dbg_sighandler(int)
{
  std::cout << dbg1 << std::endl;
  std::exit(EXIT_FAILURE);
}

Like that you just see the state of all your debug variables when you interrupt the program with ctrl+c.

In addition you can run it in a shell while loop:

$> while [ $? -eq 0 ]
   do
   ./my_program
   done

which will run your program forever until it fails ($? is the exit status of your program and you exit with EXIT_FAILURE in your signal handler).

It worked well for me, especially for finding out how many thread passed before and after what locks.

It is quite rustic, but you do not need any extra tool and it is fast to implement.

Interesting way to do it, that has the advantage of not requiring to use anything but what we already have. — Lærne, Aug 12 '15 at 07:28

How to debug a rare deadlock?

5 Answers5

Linked