0

I have a problem which I could not solve for a long time now. Since, I don't have more Ideas I am happy for any suggestions.

The program is a physics simulation which works on a huge tree data structure with millions of dynamical allocated nodes which are constructed / reorganized / destructed many times in parallel throughout the simulation with allot of pointers involved. Also this might sound very error-prone I am almost sure that I am doing all this in a thread-save manner. The program uses only standard libs and classes plus Intel-MKL (blas / lapack optimized for Intel CPUs) for matrix operations.

My code is parallelized using c++11 threads. The program runs fine on my desktop, my laptop and on two different Intel clusters using up to 8 threads. Only on one cluster the code suffers from random crashes if I use more than 2 threads (it runs absolutely fine with one or two threads).

The crash reports are varying from case to case but are mostly connected to heap corruption (segmentation fault, corrupted double linked list, malloc assertions, ...). some times the program gets caught in an infinite loop as well. In very rear cases the data structure suddenly blows up and the program runs out of memory. Anyway, since the program runs fine on all other machines I doubt the problem is in my source code. Since the crashes occur randomly I found all back tracing information relatively useless.

The hardware of the problematic cluster is almost identical to another cluster on which the code runs fine on up to 8 threads (Intel Xeon E5-2630 CPUs). The libs / compilers / OS are all relatively up to date. Note that other open-MP parallelized programs are running fine on the same cluster.
(Linux version 3.11.10-21-default (geeko@buildhost) (gcc version 4.8.1 20130909 [gcc-4_8-branch revision 202388] (SUSE Linux) ) #1 SMP Mon Jul 21 15:28:46 UTC 2014 (9a9565d))

I already tried the following approaches:

  • adding allot of assertions to assure that all my pointers are handled correctly
  • linking against tc-malloc instead of glibc-malloc/free
  • trying different compilers (g++, icpc, clang++) and compiler options (with / without compiler optimizations / debugging options)
  • using the working binary from another machine with statically linked libraries to
  • using open-MP instead of c++ threads
  • switching between serial / parallel MKL
  • using other blas / lapack libraries

Using valgrind is out of question, since the problem occurs randomly after 10 minutes up to several hours and valgrind gives me a slowdown factor of around 50 - 100 (Plus valgrind does not allow real concurrency). Nevertheless I ran the code in valgrind for several hours without problems.

Also, I can not see any problem with the resource limits:
RLIMIT_AS: 18446744073709551615
RLIMIT_CORE : 18446744073709551615
RLIMIT_CPU: 18446744073709551615
RLIMIT_DATA: 18446744073709551615
RLIMIT_FSIZE: 18446744073709551615
RLIMIT_LOCKS: 18446744073709551615
RLIMIT_MEMLOCK: 18446744073709551615
RLIMIT_MSGQUEUE: 819200
RLIMIT_NICE: 0
RLIMIT_NOFILE: 1024
RLIMIT_NPROC: 2066856
RLIMIT_RSS: 18446744073709551615
RLIMIT_RTPRIO: 0
RLIMIT_RTTIME: 18446744073709551615
RLIMIT_SIGPENDING: 2066856
RLIMIT_STACK : 18446744073709551615
RLIMIT_STACK : 18446744073709551615

I found out that for some reason the stack size per thread seems to be only 2mb, so I increased it using ulimit -s. Anyway stack size shouldn't be the problem.
Also the program should not have problem with allocatable memory on the heap, since the memory size is more than sufficient.

Does anyone have an Idea of what could go wrong here / where I should look at? Maybe I miss some environment variables I should check?
I think the fact that the error occurs only if I use more than two threads and that the crash rate for more than two threads is independent of the number of threads could be a hint.

Thanks in advance.

  • Do your threads have a shared resource which it forgets to protect when it allocates or releases the resource? Of maybe you have a pointer that is shared between the threads and one thread deletes the pointer while the other keeps on using the pointer? – Some programmer dude May 07 '15 at 13:26
  • It sounds like you might be fighting some sort of undefined behavior problem. Have a look at [this](http://stackoverflow.com/questions/7237963/a-c-implementation-that-detects-undefined-behavior) for some ways topossibly catch some problems – NathanOliver May 07 '15 at 13:27
  • @Joachim: Yes my threads share resources. Since the program is running fine on other machines for weeks without any heap corruption, plus valgrind does not report any problems, I am relatively sure that this should not be the case. – Sheld0r May 07 '15 at 13:28
  • 3
    Well that's the problem with [undefined behavior](http://en.wikipedia.org/wiki/Undefined_behavior), it might *seem* to run fine on some system, or on certain times, but change one little thing and it all goes haywire. Do you have any warnings when building your program? Have you enabled extra warnings? – Some programmer dude May 07 '15 at 13:33
  • Maybe try `-fsanitize=thread`? – Baum mit Augen May 07 '15 at 13:41
  • 2
    @Sheld0r `Since the program is running fine on other machines...` Never, ever, use those words to suggest that the program is "ok", and it is only a problem with a certain machine. This is especially the case for C++. If the program is broken for one machine, it is broken. – PaulMcKenzie May 07 '15 at 13:41
  • you could try some static code analyzer tools (coccinelle, pc-lint..). Give a shot also at thread sanitizer. You bug is probably subtle, I'll suggest to have someone else (a colleague) look at your code. More eyes can only help! – dau_sama May 07 '15 at 13:43
  • as @JoachimPileborg already linked to the other question - have a look at the thread sanitizer of gcc/clang. Also you may try *hellgrind* from the valgrind suite - it is for catching concurrency problems. Sadly it is ultra slow (x100 slowdown atleast) and I would go for the sanitizers if possible. – onqtam May 07 '15 at 13:44
  • @PaulMcKenzie: That is true indeed. – Sheld0r May 07 '15 at 13:49
  • @Joachim: I compile with -Wall , only warnings are [-Wunused-variable]. – Sheld0r May 07 '15 at 13:51
  • Add `-Wextra` and `-pedantic` as well, something might show up. As well as the sanitizing options of course. Yes it will slow down your program, but better a slow program for a little while and that helps you find your bugs, than a program that works unpredictably for a long time. – Some programmer dude May 07 '15 at 13:56
  • -fsanitize=thread seams to be a good idea, I get some data race warnings, I just have not yet figured out why :-) Thanks allot so far. Hopefully analyzing the error messages will help me to find the bug. – Sheld0r May 07 '15 at 15:24

0 Answers0