2

I have a program where I need to sort a large number of large numeric distributions. To reduce the time it takes to do this, I am trying to multi-thread this.

I wrote a small, simple abstraction of my program to try and isolate the issue. I believe I am encountering a stack overflow, or hitting the operating system's stack limit because my test program mirrors the segmentation fault issue when:

  • The distributions are all the same value (meaning qsort will run like crap)
  • Threading is enabled.

meow

#include <boost/thread/thread.hpp>
#include <vector>
#include <stdlib.h> // for rand()

void swapvals(double *distribution, const size_t &d1, const size_t &d2)
{
    double temp = 0;
    temp = distribution[d2];
    distribution[d2] = distribution[d1];
    distribution[d1] = temp;
    //std::swap(distribution[d1], distribution[d2]);

}

size_t partition(double *distribution,  size_t left, size_t right)
{
        const double pivot = distribution[right];

        while (left < right) {

                while ((left < right) && distribution[left] <= pivot)
                        left++;

                while ((left < right) && distribution[right] > pivot)
                        right--;

                if (left < right)
                {
                        swapvals(distribution, left, right);
                }
        }
        return right;
}

void quickSort(double *distribution, const size_t left, const size_t right)
{
        if (left >= right) {
                return;
        }
        size_t part = partition(distribution, left, right);
        quickSort(distribution, left, part - 1);
        quickSort(distribution, part + 1, right);
}
void processDistribution(double *distributions, const size_t distribution_size)
{

       std::clog << "beginning qsorting." << std::endl;
       quickSort(distributions, 0, distribution_size - 1);
       std::clog << "done qsorting." << std::endl;

}

int main(int argc, char* argv[])
{
    size_t distribution_size = 65000;
    size_t num_distributions = 10;

    std::vector<double *> distributions;

    // Create num_distributions distributions.
    for (int i = 0; i < num_distributions; i++)
    {
        double * new_dist = new double[distribution_size];
        for (int k = 0; k < distribution_size; k++)
        {
            // Works when I have actual numbers in the distributions.
            // Seg faults when all the numbers are the same.
            new_dist[k] =1;
            //new_dist[k] = rand() % 1000 + 1; // uncomment this, and it works.
        }

        distributions.push_back(new_dist);
    }

    // Submit each distribution to a quicksort thread.
    boost::thread_group threads;
    for (std::vector<double *>::const_iterator it=distributions.begin(); it != distributions.end(); ++it)
    {
         // It works when I run processDistribution directly. Segfaults when I run it via threads.
         //processDistribution(*it, distribution_size);
         threads.create_thread(boost::bind(&processDistribution, *it, distribution_size)); 
    }
    threads.join_all();

    // Show the results of the sort for all the distributions.
    for (std::vector<double *>::const_iterator it=distributions.begin(); it != distributions.end(); ++it)
    {
        for (size_t i = 0; i < distribution_size; i++)
        {
            // print first and last 20 results.
            if (i < 20 || i > (distribution_size - 20))
                std::cout << (*it)[i] << ",";
        }
        std::cout << std::endl;
    }

}

GDB analysis of the core file yields:

Error in re-setting breakpoint -1: aix-thread: ptrace (52, 18220265) returned -1 (errno = 3 The process does not exist.)
Error in re-setting breakpoint -1: aix-thread: ptrace (52, 18220265) returned -1 (errno = 3 The process does not exist.)
Error in re-setting breakpoint -2: aix-thread: ptrace (52, 18220265) returned -1 (errno = 3 The process does not exist.)
Error in re-setting breakpoint -3: aix-thread: ptrace (52, 18220265) returned -1 (errno = 3 The process does not exist.)
Core was generated by `testthreads'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000001000056bc in partition (distribution=0x1101d1430, left=0, right=63626) at testthreads.cpp:18

warning: Source file is more recent than executable.
18
(gdb) bt 7
#0  0x00000001000056bc in partition (distribution=0x1101d1430, left=0, right=63626) at testthreads.cpp:18
#1  0x0000000100005834 in quickSort (distribution=0x1101d1430, left=0, right=63626) at testthreads.cpp:42
#2  0x0000000100005850 in quickSort (distribution=0x1101d1430, left=0, right=63627) at testthreads.cpp:43
#3  0x0000000100005850 in quickSort (distribution=0x1101d1430, left=0, right=63628) at testthreads.cpp:43
#4  0x0000000100005850 in quickSort (distribution=0x1101d1430, left=0, right=63629) at testthreads.cpp:43
#5  0x0000000100005850 in quickSort (distribution=0x1101d1430, left=0, right=63630) at testthreads.cpp:43
#6  0x0000000100005850 in quickSort (distribution=0x1101d1430, left=0, right=63631) at testthreads.cpp:43
(More stack frames follow...)
(gdb) frame 0
#0  0x00000001000056bc in partition (distribution=0x1101d1430, left=0, right=63626) at testthreads.cpp:18
18
(gdb) info locals
pivot = 1
(gdb) info args
distribution = 0x1101d1430
left = 0
right = 63626
(gdb)

Also, my actual program deals with many more threads and distributions. And the GDB inspections there often show far more bizarre stack traces that look like memory corruption (notice how swapVals is called with d1 = 12119, but inside the partition stack frame it's coming through as 4568618016):

(gdb) bt 3
#0  0x00000001002aa0b8 in ScenRankReplacer<double>::swapvals (this=0xfffffffffffdfc8, distribution=..., d1=@0x1104c8178: 4568618016, d2=@0x1104c8140: 4568416720, ranking_values=0x1104c81d0,
    r1=@0x1104c8170: 1152921504606838728, r2=@0x1002a16c8: 6917529029728344952) at ScenRankReplacer.h:96
#1  0x00000001002a7120 in ScenRankReplacer<double>::partition (this=0xfffffffffffdfc8, distribution=..., ranking_values=0x11069ae50, left=1, right=24237) at ScenRankReplacer.h:122
#2  0x00000001002a16c8 in ScenRankReplacer<double>::quickSort (this=0xfffffffffffdfc8, distribution=..., ranking_values=0x11069ae50, left=1, right=24237) at ScenRankReplacer.h:91
(More stack frames follow...)
(gdb) frame 1
#1  0x00000001002a7120 in ScenRankReplacer<double>::partition (this=0xfffffffffffdfc8, distribution=..., ranking_values=0x11069ae50, left=1, right=24237) at ScenRankReplacer.h:122
122             swapvals(distribution, mid, left, ranking_values, mid - 1, left - 1);
(gdb) p mid
$1 = 12119
(gdb) p left
$2 = 1

So...my questions:

  1. Am I correct? Am I hitting a stack limit?
  2. How on earth do I ascertain that this is the case (other than the deduction i've done above)? Is there an easy way to detect these? a GDB clue or something?
  3. Why does the threading matter? Do all the threads share the same stack limit?
  4. Most important: How do I get this working?! Is a recursive quicksort on massive datasets just not feasable?

The error occurs with compilation level O2. Thread model: aix gcc version 4.8.3 (GCC)

user1848244
  • 425
  • 3
  • 12
  • Ummm... 1st thing while debugging multi-threading software I do check is: does it work on a single thread? Is it really so hard to determine for yourself whether you hit the stack limit? You should be able to get the stack size per recursion and recursion level for a given data set. Please bear in mind you are asking a debugging question and you are not very specific to be honest. In your "My questions" there are 9 question sentences... There is at least 5 follow up questions I could pose, this is not feasible for Q&A format. At least in my humble opinion. – luk32 Sep 14 '15 at 23:51
  • I mention that when no threads are used it works. (It fails when I use 1 or more threads). How would you get the stack size and number of recursions across all the threads? I've put a 'debugging' tag into the question. – user1848244 Sep 15 '15 at 00:02
  • I would suggest to add more logs to your program. If stack overflow actually happens - your code will crash right on calling the function or {} block that causes it. – Dmitrii Z. Sep 15 '15 at 00:08
  • It's passing a pointer by value to the function. The data being pointed at stays in scope until the program ends so it wouldn't be that. – user1848244 Sep 15 '15 at 00:48
  • After some investigations i found out that u can't (http://stackoverflow.com/questions/15435087/how-to-set-pthread-max-stacksize) set pthread's stacksize bigger than PTHREAD_STACK_MIN + 0xFFFF and this value is not enough for your use case (you get Bus error: 10 on #5422 recursive call of quickSort with best optimizes). The best solution i see (without complete algo modifying) is creating new processes instead of threads. Processes' stack can grow up dynamically (http://unix.stackexchange.com/questions/63742/what-is-automatic-stack-expansion) – Dmitrii Z. Sep 15 '15 at 01:45

2 Answers2

1

This looks like it could be stack space related. Threading matters because, while all threads have their own stacks, those stacks all share the same memory pool. Stacks will normally grow as needed until they run into memory that is already used, which in this case would likely be the stack from another thread. A single threaded program won't have that issue and can grow it's stack larger. (Also with multiple threads you're doing multiple sorts at the same time which would require more stack space.)

One way to fix this is to remove the recursion and use some loops and local storage to replace it. Something like this (uncompiled or tested) code:

void quickSort(double *distribution, size_t left, size_t right) {
    std::vector<std::pair<size_t, size_t>> ranges;
    for (;;) {
        for (;;) {
            if (left <= right)
                break;
            size_t part = partition(distribution, left, right);

            // save range for later to replace the second recursive call
            ranges.push_back(std::make_pair(part + 1, right));

            // set right == part - 1, then loop, to replace the first recursive call
            right = part - 1;
        }
        if (ranges.empty())
            break;

        // Take top off of ranges for the next loop, replacing the second recursive call
        left = ranges.back().first;
        right = ranges.back().second;
        ranges.pop_back();
    }
}
1201ProgramAlarm
  • 32,384
  • 7
  • 42
  • 56
0

so after a bit more hair pulling I've figured out the answers to my questions.

  1. Am I correct? Am I hitting a stack limit? How on earth do I ascertain that this is the case (other than the deduction i've done above)? AND

  2. Is there an easy way to detect these? a GDB clue or something?

A: Yes. The program was overflowing the stack. I could not ascertain a direct way to determine that this was the case on AIX. However, when I put the code into visual studio 2015 on Windows and ran it, the program crashed with a clear "Stack Overflow" error.

I was hoping there was a way to get a clear 'Stack Overflow' error on AIX, similar to the VS result. I could not find a way. Even compiling using -fstack-check didn't give me a Storage Error :(

  1. Why does the threading matter? Do all the threads share the same stack limit?

A: The default stack size for threads on AIX is surprisingly small!

From this IBM developerworks blog post:

For a 32-bit compiled application on AIX, the default pthread stacksize is 96 KB; and for a 64-bit compiled application on AIX,

  1. Most important: How do I get this working?! Is a recursive quicksort on massive datasets just not feasable?

I can only conceive of two ways: A1: The first would be to increase the stack size.

From the IBM Debugging Guidelines for Threads The minimum stack size for a thread is 96KB. It is also the default stack size. This number can be retrieved at compilation time using the PTHREAD_STACK_MIN symbolic constant defined in the pthread.h header file.

Note that the maximum stack size is 256MB, the size of a segment. This limit is indicated by the PTHREAD_STACK_MAX symbolic constant in the pthread.h header file.

So one could increase the stack size to a maximum of 256MB which is quite a lot.

A2: The other way is to simply avoid potentially unbound recursion. My datasets are incredibly large. Probably not large enough to spend 256MB of stack, but it was reasonably straightforward to rewrite the quicksort function iteratively.

void quickSort_iter(double *distribution, size_t left, size_t right)
{
        if (left >= right)
                return;

        std::stack<std::pair<size_t, size_t> > partition_stack;
        partition_stack.push(std::pair<size_t, size_t>(left, right));

        while (!partition_stack.empty())
        {

                left = partition_stack.top().first;
                right = partition_stack.top().second;
                partition_stack.pop();

                size_t pivot = partition(distribution, left, right);

                if (pivot > 1)
                        partition_stack.push(std::pair<size_t, size_t>(left, pivot - 1));

                if (pivot + 1 < right)
                        partition_stack.push(std::pair<size_t, size_t>(pivot + 1, right));
        }
}

std::stack is being created using the default std::allocator, so internally it is using heap allocations to store the stack of sorting partitions and therefore will not run afoul of the stack limit.

user1848244
  • 425
  • 3
  • 12