I have a problem which I could not solve for a long time now. Since, I don't have more Ideas I am happy for any suggestions.
The program is a physics simulation which works on a huge tree data structure with millions of dynamical allocated nodes which are constructed / reorganized / destructed many times in parallel throughout the simulation with allot of pointers involved. Also this might sound very error-prone I am almost sure that I am doing all this in a thread-save manner. The program uses only standard libs and classes plus Intel-MKL (blas / lapack optimized for Intel CPUs) for matrix operations.
My code is parallelized using c++11 threads. The program runs fine on my desktop, my laptop and on two different Intel clusters using up to 8 threads. Only on one cluster the code suffers from random crashes if I use more than 2 threads (it runs absolutely fine with one or two threads).
The crash reports are varying from case to case but are mostly connected to heap corruption (segmentation fault, corrupted double linked list, malloc assertions, ...). some times the program gets caught in an infinite loop as well. In very rear cases the data structure suddenly blows up and the program runs out of memory. Anyway, since the program runs fine on all other machines I doubt the problem is in my source code. Since the crashes occur randomly I found all back tracing information relatively useless.
The hardware of the problematic cluster is almost identical to another cluster on which the code runs fine on up to 8 threads (Intel Xeon E5-2630 CPUs). The libs / compilers / OS are all relatively up to date. Note that other open-MP parallelized programs are running fine on the same cluster.
(Linux version 3.11.10-21-default (geeko@buildhost) (gcc version 4.8.1 20130909 [gcc-4_8-branch revision 202388] (SUSE Linux) ) #1 SMP Mon Jul 21 15:28:46 UTC 2014 (9a9565d))
I already tried the following approaches:
- adding allot of assertions to assure that all my pointers are handled correctly
- linking against tc-malloc instead of glibc-malloc/free
- trying different compilers (g++, icpc, clang++) and compiler options (with / without compiler optimizations / debugging options)
- using the working binary from another machine with statically linked libraries to
- using open-MP instead of c++ threads
- switching between serial / parallel MKL
- using other blas / lapack libraries
Using valgrind is out of question, since the problem occurs randomly after 10 minutes up to several hours and valgrind gives me a slowdown factor of around 50 - 100 (Plus valgrind does not allow real concurrency). Nevertheless I ran the code in valgrind for several hours without problems.
Also, I can not see any problem with the resource limits:
RLIMIT_AS: 18446744073709551615
RLIMIT_CORE : 18446744073709551615
RLIMIT_CPU: 18446744073709551615
RLIMIT_DATA: 18446744073709551615
RLIMIT_FSIZE: 18446744073709551615
RLIMIT_LOCKS: 18446744073709551615
RLIMIT_MEMLOCK: 18446744073709551615
RLIMIT_MSGQUEUE: 819200
RLIMIT_NICE: 0
RLIMIT_NOFILE: 1024
RLIMIT_NPROC: 2066856
RLIMIT_RSS: 18446744073709551615
RLIMIT_RTPRIO: 0
RLIMIT_RTTIME: 18446744073709551615
RLIMIT_SIGPENDING: 2066856
RLIMIT_STACK : 18446744073709551615
RLIMIT_STACK : 18446744073709551615
I found out that for some reason the stack size per thread seems to be only 2mb, so I increased it using ulimit -s. Anyway stack size shouldn't be the problem.
Also the program should not have problem with allocatable memory on the heap, since the memory size is more than sufficient.
Does anyone have an Idea of what could go wrong here / where I should look at? Maybe I miss some environment variables I should check?
I think the fact that the error occurs only if I use more than two threads and that the crash rate for more than two threads is independent of the number of threads could be a hint.
Thanks in advance.