0

this is a little bit general question, I have a segfault in a multithreaded program, and bt coredump shows below,

(gdb) bt full
#0  0x0000000000441540 in try_dequeue<std::shared_ptr<Frame> > (item=<synthetic pointer>, this=0xbe3c50) at /root/projects/active/user/include/third_party/concurrentqueue.h:1111
        nonEmptyCount = 0
        best = 0x0
        bestSize = 0
#1  ConsumerNice::listening_nice (this=0xbe3c40) at /root/projects/active/user/include/concurrency/consumer_nice.h:45
        frame = std::shared_ptr (empty) 0x0
#2  0x00000000004c0530 in execute_native_thread_routine ()
No symbol table info available.
#3  0x00007f3eb3f81e65 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#4  0x00007f3ead70a88d in clone () from /lib64/libc.so.6
No symbol table info available.

So I go to look at the source code, my code as below

 void listening_nice() {
        while (true) {
            std::shared_ptr<Frame> frame;
            if (nice_queue.try_dequeue(frame)) {
                on_frame_nice(frame);
            }
        }
    }

and cameron314/concurrentqueue part look like below,

bool try_dequeue(U& item)
    {
        // Instead of simply trying each producer in turn (which could cause needless contention on the first
        // producer), we score them heuristically.
        size_t nonEmptyCount = 0;
        ProducerBase* best = nullptr;
        size_t bestSize = 0;
        for (auto ptr = producerListTail.load(std::memory_order_acquire); nonEmptyCount < 3 && ptr != nullptr; ptr = ptr->next_prod()) {
            auto size = ptr->size_approx();
            if (size > 0) {
                if (size > bestSize) {
                    bestSize = size;
                    best = ptr;
                }
                ++nonEmptyCount;
            }
        }

It doesnt seem possible to cause segfault, therefore I am wondering, is bt always show the culprit thread? or there is a chance segfault is caused by some other problem in some other thread, or even the operating system?

Noted this program is running on 3 same configured machine, but only one machine crashes once a day, that is it runs for 3 straight hours on that one machine, then crashed.

tesla1060
  • 2,621
  • 6
  • 31
  • 43
  • 4
    The backtrace will show the thread that actually made an illegal memory access. This doesn't mean it's the "culprit" - the thread might be running perfectly correct code, but encounter garbage data that was corrupted by another thread whose code is written incorrectly. With that said, I haven't validated or checked your claim "It doesnt seem possible to cause segfault" - it may not be true. I would not blame the operating system without good reason - unless you're using a niche or research-grade operating system, its correctness has been thoroughly tested by its userbase already. – nanofarad Jan 08 '23 at 01:18
  • 1
    Btw; use `thread apply all bt` to get stacktraces for all threads. – Jesper Juhl Jan 08 '23 at 01:25
  • 2
    Just because this is where the program crashes or reports an error doesn't mean this is where the problem is. C++ does not work this way. The problem can be anywhere in your code, but after the bug occurs the program keeps running for a little bit before it finally crashes here. This is why stackoverflow.com's [help] requires you to show a [mre] that everyone else can cut/paste ***exactly as shown***, then compile, run, and reproduce your problem. See [ask] for more information. Until you do that, it is unlikely that anyone will be able to answer your question. – Sam Varshavchik Jan 08 '23 at 02:01
  • I've seen bugs in program start-up finally manifest visible behaviour months later. You'll get valuable hints from a stack trace, but might not get a slam-dunk answer. And you might not even recognize the significance of those valuable hints until you've tripped over more clues the hard way. But if programming were easy, I'd probably be making more money teaching philosophy, writing speeches for politicians, or even more likely starving in the street. – user4581301 Jan 08 '23 at 02:45
  • Note that stack corruption can corrupt the stack trace – Alan Birtles Jan 08 '23 at 03:27
  • @SamVarshavchik a reproducible code is alomost next to impossible in this case, it only crashes once per day, and it only crashes in one of the three machines that running the same code. – tesla1060 Jan 08 '23 at 07:38
  • *It doesnt seem possible to cause segfault* - It is very well possible. If `ptr` is not `nullptr`, but also NOT valid pointer you will get segfault. – sklott Jan 08 '23 at 08:16
  • If it's "next to impossible" even for you to reproduce it, I have no idea how anyone else could help in any way... – Sam Varshavchik Jan 08 '23 at 14:58
  • You may want to look at the actual memory contents. – tevemadar Jan 08 '23 at 15:05
  • @tevemadar can that be done with coredump? – tesla1060 Jan 09 '23 at 00:08
  • Honestly? I don't know the how. But by definition, a core dump contains snapshots of memory regions. If it doesn't contain enough, this older question may still be applicable: https://stackoverflow.com/questions/17965/how-to-generate-a-core-dump-in-linux-on-a-segmentation-fault – tevemadar Jan 09 '23 at 11:16
  • `bt` says the segfault was on line 1111. Which line is that? What are the values of the variables on that line? – Mark Plotnick Jan 10 '23 at 11:56

0 Answers0