-5

I am developing a computational geometry application in c++. This runs in parallel using threads and openmp. So, I get some geometrical values (such as nodes, edges, etc) and produce an output. This is working almost always perfect. However, there are cases like 1% that I get this messed up result. The application doesn't crash but I get really bad results, such as my output has random memory values. But even if I run on the same data twice, the second time it's gonna run fine. I used valgrind and helgrind but they didn't detect any related error. So, I am starting to run out of ideas how to trace it. Is there any other tool to try that detects possible thread errors better than helgrind? Or is there any idea on how to replicate such a problem and how to record the exact state that led to that bug?

Cœur
  • 37,241
  • 25
  • 195
  • 267
Zovrix
  • 1
  • 1
  • 3
  • 1
    Welcome to Stack Overflow. Please take the time to read [The Tour](http://stackoverflow.com/tour) and refer to the material from the [Help Center](http://stackoverflow.com/help/asking) what and how you can ask here. – πάντα ῥεῖ May 20 '17 at 10:58
  • 1
    Intermittent bugs 101: log and note everything. Change inputs, number of threads, log and note everything. Simplify, comment out calls etc. even if the results will obviously then be wrong, looking for consistency, (even if that 'consistency' is that it consistently crashes - that is a very good thing:), log and note everything. Eventually, you will find the bug. I'm afraid there is no substitute for hard work with the debugger and logger, (and experience:). – ThingyWotsit May 20 '17 at 11:08
  • If it makes you feel any better, (and it won't:), my record for eradication of an intermittent bug is six months. The bug only manifested when a) more than one of one paritcular type of peripheral was logged in AND b) one particular update was triggered by more than one of the peripherals AND c) system shutdown was uncontrolled, (eg. power fail). – ThingyWotsit May 20 '17 at 11:14
  • Thank you very much for the advice. I thought that I log nearly everything but you just reminded me that there are still countless things left to be logged. The thing is I don't know how to stress the application to actually force thread errors, so I can easily see this bad result. It's the first time I have a bug that is so rare, and even if I manage to reproduce it, valgrind and memory sanitizers say that everything is fine. – Zovrix May 20 '17 at 11:16
  • 1
    @Zovrix Also logging might change the behavior significantly, especially when dealing with race conditions and such. – πάντα ῥεῖ May 20 '17 at 11:19
  • Try [ThreadSanitizer](https://clang.llvm.org/docs/ThreadSanitizer.html), [AddressSanitizer](https://clang.llvm.org/docs/AddressSanitizer.html), [UndefinedBehaviorSanitizer](https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html) & [MemorySanitizer](https://clang.llvm.org/docs/MemorySanitizer.html). They are *quite* effective at finding subtle (and not so subtle) bugs. – Jesper Juhl May 20 '17 at 11:58

1 Answers1

0

Disclaimer: I have not used the approach below using OpenMP but based on what I just looked up it seems to be possible.


I have had a similar bug I needed to reproduce in GDB. This post helped me to run the application indefinitely until a segmentation fault occured.

We could adapt this answer to answer your question by adding a conditional break point that hits when the output value is not as expected.

set pagination off
break exit
commands
run
end
break file.cpp:123 if some_condition_holds

Now, if you would run the above with GDB it would run indefinitely until the bad result occurs (some_condition_holds is true). Then we can switch to the correct thread by using the inferior commands:

info inferiors
inferior inferior_num