1

I am supporting an application written in C++ over many years and as of late it has started to crash providing core dumps that we don't know how to handle. It runs on an appliance on Ubuntu 14.04.5

When loading the core file in GDB it says that: Program terminated with signal SIGABRT, Aborted

I can inspect 230 threads but they are all in wait() in the exact same memory position.

There is a thread with ID 1 that in theory could be the responsible but that thread is also in wait.

So I have two questions basically.

How does the id index of the threads work? Is thread with GDB ID 1 the last active thread? or is that an arbitrary index and the failure can be in any of the other threads?

How can all threads be in wait() when a SIGABRT is triggered? Shouldn't the instruction pointer be at the failing command when the OS decided to step in an halt the process? Or is it some sort of deadlock protection?

Any help much appreciated.

Backtrace of thread 1:

#0  0xf771dcd9 in ?? ()
#1  0xf74ad4ca in _int_free (av=0x38663364, p=<optimized out>,have_lock=-186161432) at malloc.c:3989
#2  0xf76b41ab in std::string::_Rep::_M_destroy(std::allocator<char> const&) () from /usr/lib32/libstdc++.so.6
#3  0xf764f82f in operator delete(void*) () from /usr/lib32/libstdc++.so.6
#4  0xf764f82f in operator delete(void*) () from /usr/lib32/libstdc++.so.6
#5  0x5685e8b4 in SlimStringMapper::~SlimStringMapper() ()
#6  0x567d6bc3 in destroy ()
#7  0x566a40b4 in HttpProxy::getLogonCredentials(HttpClient*, HttpServerTransaction*, std::string const&, std::string const&, std::string&, std::string&) ()
#8  0x566a5d04 in HttpProxy::add_authorization_header(HttpClient*, HttpServerTransaction*, Hosts::Host*) ()
#9  0x566af97c in HttpProxy::onClientRequest(HttpClient*, HttpServerTransaction*) ()
#10 0x566d597e in callOnClientRequest(HttpClient*, HttpServerTransaction*, FastHttpRequest*) ()
#11 0x566d169f in GateKeeper::onClientRequest(HttpClient*, HttpServerTransaction*) ()
#12 0x566a2291 in HttpClientThread::run() ()
#13 0x5682e37c in wa_run_thread ()
#14 0xf76f6f72 in start_thread (arg=0xec65ab40) at pthread_create.c:312
#15 0xf75282ae in query_module () at ../sysdeps/unix/syscall-template.S:82
#16 0xec65ab40 in ?? ()

Another thread that should be in wait:

#0  0xf771dcd9 in ?? ()
#1  0x5682e37c in wa_run_thread ()
#2  0xf76f6f72 in start_thread (arg=0xf33bdb40) at pthread_create.c:312
#3  0xf75282ae in query_module () at ../sysdeps/unix/syscall-template.S:82
#4  0xf33bdb40 in ?? ()

Best regards Jon

3 Answers3

2

How can all threads be in wait() when a SIGABRT is triggered?

Is wait the POSIX function, or something from the run-time environment? Are you looking at a higher-level backtrace?

Anyway, there is an easy explanation why this can happen: SIGABRT was sent to the process, and not generated by a thread in a synchronous fashion. Perhaps a coworker sent the signal to create the coredump, after observing the deadlock, to collect evidence for future analysis?

Florian Weimer
  • 32,022
  • 3
  • 48
  • 92
  • As I mentioned in an answer below I might not be 100% sure that this is wait() anymore. All threads are in the exact memory position and as this was some time back I cannot now remember how I deduced that this was wait other than analyzing other threads that should be in wait. Coworkers kan be ruled out :) – Jon Salgerier Oct 23 '17 at 08:58
1

How does the id index of the threads work? Is thread with GDB ID 1 the last active thread?

When the program is running under GDB, GDB numbers threads as it discovers them, so thread 1 is always the main thread.

But when loading a core dump, GDB discoveres threads in the order in which the kernel saved them. The kernels that I have seen always save the thread which caused program termination first, so usually loading core into GDB immediately gets you to the crash point without the need to switch threads.

How can all threads be in wait() when a SIGABRT is triggered?

One possiblity is that you are not analyzing the core correctly. In particular, you need exact copies of shared libraries that were used at the time when the core was produced, and that's unlikely to be the case when the application runs on "appliance" and you are analysing core on your development machine. See this answer.

Employed Russian
  • 199,314
  • 34
  • 295
  • 362
  • I realize now (in hindsight) that I might have been a bit quick deducing that the position is wait(). This was a few weeks back but I seem to remember seeing that in another core dump. But in this case we just see ?? and I suppose that this suggests freeing same memory twice judging by the bt. All other threads are however in the exact same memory position. And those are threads waiting to be used. But again this might have been too quick an assumption... – Jon Salgerier Oct 23 '17 at 08:53
  • Thank you for the reply by the way :) I am analyzing the core file on a replica of the machine so the libs should be the same. However I had to install debug symbols for the shared libraries through apt-get but that should be safe right? – Jon Salgerier Oct 23 '17 at 08:55
  • Installing debugging information is safe as long as it does not trigger library upgrades along the way. – Florian Weimer Oct 23 '17 at 09:09
0

I just saw your question. First of all my answer is not specific to you direct question but some solution to handle this kind of situation. Multi-threading entirely depend on the hardware and operating system of a machine. Especially memory and processors. Increase in thread means requirement of more memory as well as more time slice for processor. I don’t think your application have more than 100 processor to facilitate 230 thread to run concurrently with highest performance. To avoid this situation do the below steps which may help you.

  1. Control the creation of threads. Control number of threads running concurrently.
  2. Increase the memory size of your application. (check compiler options to increase memory for the application at run time or O/S to allocate enough memory)
  3. Set grid size and stack size of each thread properly. (calculation need to be done based on your application’s threads functionality, this is bit complicated. Please read some documentation)
  4. Handle synchronized block properly to avoid any deadlock.
  5. Where necessary use conditional lock etc.

As you told that most of your threads are in wait condition, that means they are waiting for a lock to release for their turn, that means one of the thread already acquire the lock and still busy in processing or probably in deadlock situation.

Abhijit Pritam Dutta
  • 5,521
  • 2
  • 11
  • 17