3

I'm learning sorting algorithms and as next step, I'm trying to get my implementation perform close to the std::sort(). I'm pretty far, so far.. :-)

I have 3 implementations of quicksort:

  • standard quicksort (using temp arrays).
  • quicksort with following optimizations:
    • median3 used to select median
    • tail-recursion
    • quicksort applied only upto partition sizes < 16. For smaller partitions insertion sort is used.
    • insertion sort applied to the whole array at once instead of applying to each partition, left unsorted by quicksort.
  • quicksort with all optimizations listed above + in-place partitioning (no temp arrays).

I expected the performance to be best bottom up, but it's best top down!

What's wrong with my implementation? Given the drastic difference between the performance I assume there is something outrightly wrong.

Some numbers to give you a feel of how bad things are (N = number of elements in array, figures are time taken by each algo in microseconds): Sorting vector<int> and each algo is given exactly the same sequence of numbers.

N           quick       mixed       mixed_inplace
8           0           0           0
16          0           1           1
32          1           2           2
64          1           3           3
128         1           8           8
256         3           16          17
512         6           34          41
1,024       16          84          87
2,048       28.3        177.1       233.2
4,096       48.5        366.6       410.1
8,192       146.5       833.5       1,012.6
16,384      408.4       1,855.6     1,964.2
32,768      1,343.5     3,895.0     4,241.7
65,536      2,661.1     7,927.5     8,757.8

Using Visual Studio Express 2010.

CODE:

// ------------ QUICK SORT ------------------
template<typename T, typename key_compare>
void quicksort( T first, T last, const size_t pivot_index, key_compare comp ) {
    T saved_first = first;
    size_t N = last - first;
    if (N > 1) {
        // create a temp new array, which contains all items less than pivot
        // on left and greater on right. With pivot value in between.
        // vector<typename decltype(*T)> temp(N);
        typename iterator_traits<T>::pointer temp = (typename iterator_traits<T>::pointer) malloc(sizeof(T)*N);
        size_t left_index = 0, right_index = N - 1 ;
        iterator_traits<T>::value_type pivot_val = *(first + pivot_index);
        for(; first < saved_first + pivot_index; first++) {
            if( !comp(*first, pivot_val) )
                temp[right_index--] = *first;
            else
                temp[left_index++] = *first;
        }
        // skip the pivot value
        // TODO: swap the pivot to end so we can have a single loop instead.
        ++first;
        // do the rest
        for(; first < last; first++) {
            if( !comp(*first, pivot_val) )
                temp[right_index--] = *first;
            else
                temp[left_index++] = *first;
        }
        if( right_index == left_index )
            temp[left_index] = pivot_val;
        else
            temp[left_index+1] = pivot_val;
        // recurse for left and right..
        quicksort(temp, temp+left_index, left_index/2, comp);
        quicksort(temp+left_index+1, temp+N, (N-right_index)/2, comp);

        // return a concat'd array..
        for(size_t i = 0; i < N; i++)
            *saved_first++ = temp[i];

        free(temp);
    }
}
/*
** best, average, worst: n log n, n log n, n^2
** space: log n
*/
template<typename T, typename key_compare >
void quicksort( T first, T last, key_compare comp ) {
    size_t pivot_index = (last - first) / 2;
    quicksort( first, last, pivot_index, comp);
}

// ------------ QUICK with optimizations ------------------
/*
quicksort partition on range [first, last[ using predicate function as the comparator.
"mid" is in-out param, function uses mid as mid, and updates it before returning it with
current/new mid position after partitioning.
*/
template<typename T, typename key_compare >
void _partial_quicksort_partition( T first, T last, T& mid, key_compare comp ) {
    T savedFirst = first;
    typedef typename iterator_traits<T>::value_type _val_type;
    size_t N = last - first;
    _val_type *temp = (_val_type *) malloc((N*sizeof(_val_type)));

    // move pivot to the end..
    _val_type pivot_val = *mid;
    std::swap(*mid, *(last - 1));
    size_t left_index = 0, right_index = N - 1;

    for( ; first != last - 1; first++ ) {
        if( !comp(*first, pivot_val) )
            temp[right_index--] = *first;
        else
            temp[left_index++] = *first;
    }

    assert( right_index == left_index );

    temp[left_index] = pivot_val;

    std::copy(temp, temp+N, savedFirst);
    free(temp);
    mid = savedFirst + left_index;
}

template<typename T, typename key_compare >
void _partial_quicksort( T first, T last, key_compare comp ) {
    size_t s = last - first;
    // sort only if the list is smaller than our limit.. else it's too small for
    // us to bother.. caller would take care of it using some other stupid algo..
    if( 16 > s ) {
        // only one call to insertion_sort(), after whole array is partially sorted
        // using quicksort().
        // my_insertion_sort::insertion_sort(first, last, pred);
        return ;
    }

    // select pivot.. use median 3
    T mid = my_mixed_inplace_quicksort::median3(first, last - 1, s, comp);
    // partition
    _partial_quicksort_partition(first, last, mid, comp);
    // recurse..
    _partial_quicksort(first, mid, comp);
    // tail recurse..
    // TODO: tail recurse on longer partition..
    _partial_quicksort(mid+1, last, comp);
}

template<typename T, typename key_compare >
void mixed_quicksort( T first, T last, key_compare pred ) {
    _partial_quicksort(first, last, pred );
    my_insertion_sort::insertion_sort(first, last, pred);
}

// ------------ "in place" QUICK with optimizations ------------------
/*
in place quicksort partition on range [first, last[ using predicate function as the comparator.
"mid" is in-out param, function uses mid as mid, and updates it before returning it with
current/new mid position after partitioning.
*/
template<typename T, typename key_compare >
void _partial_inplace_quicksort_partition( T first, T last, T& mid, key_compare comp ) {
    typename iterator_traits<T>::value_type midVal = *mid;
    // move pivot to end..
    std::swap(*mid, *(last - 1));
    mid = first;
    // in-place quick sort:
    for( ; first < last - 1; first++ ) {
        if( comp(*first, midVal) ) {
            std::swap(*first, *mid);
            mid++;
        }
    }
    // bring pivot to the mid..
    std::swap(*mid, *(last - 1));
}

// brings best median to middle and returns it..
// works on array as [first, last] NOT [first, last[
template<typename T, typename key_compare >
T median3(T first, T last, size_t size, key_compare comp ) {
    T mid = first + size/2;
    if (comp(*mid, *first)) {
        std::swap(*mid, *first);
    }
    if (comp(*last, *mid)) {
        std::swap(*last, *mid);
    }
    if (comp(*mid, *first)) {
        std::swap(*mid, *first);
    }
    return mid;
}

template<typename T, typename key_compare >
void _partial_inplace_quicksort( T first, T last, key_compare comp ) {
    size_t s = last - first;
    // sort only if the list is smaller than our limit.. else it's too small for
    // us to bother.. caller would take care of it using some other stupid algo..
    if( 16 > s ) {
        // only one call to insertion_sort(), after whole array is partially sorted
        // using quicksort().
        // my_insertion_sort::insertion_sort(first, last, pred);
        return ;
    }

    // select pivot.. use median 3
    T mid = median3(first, last - 1, s, comp);
    // partition
    _partial_inplace_quicksort_partition(first, last, mid, comp);
    // recurse..
    _partial_inplace_quicksort(first, mid, comp);
    // tail recurse..
    _partial_inplace_quicksort(mid+1, last, comp);
    // print_array(first, last, "_partial_inplace_quicksort(exit2)" );
}

// in-place quick sort
// tail recurse
// median
// final insertion sort..
template<typename T, typename key_compare >
void mixedsort_inplace( T first, T last, key_compare pred ) {
    _partial_inplace_quicksort(first, last, pred );
    my_insertion_sort::insertion_sort(first, last, pred);
}

// ---------------- INSERTION SORT used above ----------------
namespace my_insertion_sort {
    template<typename T, typename key_compare>
    void insertion_sort( T first, T last, key_compare comp ) {
        // for each element in the array [first+1, last[
        for( T j = first+1; j < last; j++) {
            iterator_traits<T>::value_type curr = *j;
            T hole = j;
            // keep moving all the elements comp(hole.e. > or <) hole to right
            while( hole > first && comp(curr, *(hole-1)) ) {
                *hole = *(hole-1);
                --hole;
            }
            // insert curr at the correct position.
            *hole = curr;
        }
    }
}

Code used for testing:

#include <ctime>
#ifdef _WIN32
#include <Windows.h>
#include <WinBase.h>
#endif // _WIN32

template<typename T, typename key_compare = std::less<T>> class MySortAlgoTester;

template <typename T>
void print_array( T begin, T end, string prefix = "" ) {
    cout << prefix.c_str();
    for_each(begin, end, []( typename std::iterator_traits<T>::reference it) { cout << it << ','; } );
    cout << endl;
}

int main () {
    srand(123456789L);
    size_t numElements = 4;
    vector<char*> algoNames;
    map<double, vector<double>> results;
    int numTests = 0;
    while( (numElements *= 2) <= 4096*16 ) {
        MySortAlgoTester<int> tester(numElements);
        results[numElements] = vector<double>();
        algoNames.push_back("mixedsort_inplace");
        results[numElements].push_back(tester.test(my_mixed_inplace_quicksort::mixedsort_inplace, "mixedsort_inplace"));
        tester.reset();
        algoNames.push_back("quick");
        results[numElements].push_back(tester.test(my_quicksort::quicksort, "quicksort"));
        tester.reset();
        algoNames.push_back("mixed_quicksort");
        results[numElements].push_back(tester.test(my_mixed_quicksort::mixed_quicksort, "mixed_quicksort"));
    }
    // --- print the results...
    cout << std::setprecision(2) << std::fixed << endl << "N";
    for_each(algoNames.begin(), algoNames.begin()+(algoNames.size()/numTests), [](char* s){cout << ',' << s ;} );

    typedef std::pair<double,vector<double>> result_iter;
    BOOST_FOREACH(result_iter it, results) {
        cout << endl << it.first << ',';
    BOOST_FOREACH( double d, it.second ) {
        cout << d << ',' ;
    }
}

template<typename T, typename key_compare = std::less<T>>
class MySortAlgoTester {
    key_compare comp;
    vector<T> vec;
    typedef typename vector<T>::iterator vecIter;
    vector<T> vec_copy;
    size_t m_numElements;
    bool is_sorted(vecIter first, vecIter last) {
        vecIter sFirst = first;
        for(vecIter next = first+1; next != last;)
            // '>' associativity: left to right
            // ++ has precedence over >
            if( !comp(*(first++), *(next++)) ) {
                if(*(next-1) == *(first-1))
                    continue;
                print_array(sFirst, last, "is_sorted() returning false: ");
                cout << "comp(" << *(first-1) << ", " << *(next-1) << ") = false && "
                    << *(next-1) << " != " << *(first-1) << endl ;
                return false;
            }

            return true;
    }

public:
    MySortAlgoTester(size_t numElements) : m_numElements(numElements) {
        srand(123456789L);
        vec.resize(m_numElements);
        vec_copy.resize(m_numElements);
        //      std::generate(vec.begin(), vec.end(), rand);
        for(size_t i = 0; i < vec.size(); i++) {
            vec[i] = rand() % (m_numElements * 2);
            vec_copy[i] = vec[i];
        }
    }
    ~MySortAlgoTester() {
    }

    void reset() {
        // copy the data back so next algo can be tested with same array.
        std::copy(vec_copy.begin(), vec_copy.end(), vec.begin());
        for(size_t i = 0; i < vec_copy.size(); i++) {
            vec[i] = vec_copy[i];
        }
        // std::copy(vec_copy.begin(), vec_copy.end(),  vec);
    }

    double m___start_time_asdfsa = 0;
    double getTimeInMicroSecs() {
    #ifdef _WIN32
        LARGE_INTEGER li;
        if(!QueryPerformanceFrequency(&li))
            cout << "getTimeInMicroSecs(): QueryPerformanceFrequency() failed!" << endl;

        QueryPerformanceCounter(&li);
        return double(li.QuadPart)/1000.0;

    #else // _WIN32
        struct timeval tv;
        gettimeofday(&tv, NULL);
        return tv.tv_usec + 10e6 * tv.tv_sec;
    }
    #endif // _WIN32

    inline void printClock( const char* msg ) {
        cout << msg << (long)(getTimeInMicroSecs() - m___start_time_asdfsa) << " micro seconds" << endl;
    }
    inline double getClock() {
        return (getTimeInMicroSecs() - m___start_time_asdfsa);
    }
    inline void startClock() {
        m___start_time_asdfsa = getTimeInMicroSecs();
    }

    double test( void (*sort_func)(typename vector<T>::iterator first, typename vector<T>::iterator last, typename key_compare pred), const char* name ) {
        cout << "START Testing: " << name << ". With --- " << m_numElements << " elements." << endl;
        startClock();

        sort_func(vec.begin(), vec.end(), comp);
        double ms = getClock();
        if(!MySortAlgoTester::is_sorted(vec.begin(), vec.end())) {
            cout << name << " did not sort the array." << endl;
            // throw string(name) + " did not sort the array.";
        }
        cout << "DONE Testing: " << name << ". Time taken (ms): " << ms << endl;
        return ms;
    }

    double test( void (*sort_func)(typename vector<T>::iterator first, typename vector<T>::iterator last), const char* name ) {
        cout << "START Testing: " << name << ". With --- " << m_numElements << " elements." << endl;
        startClock();

        sort_func(vec.begin(), vec.end());
        double ms = getClock();
        if(!MySortAlgoTester::is_sorted(vec.begin(), vec.end())) {
            cout << name << " did not sort the array." << endl;
            // throw string(name) + " did not sort the array.";
        }
        cout << "DONE Testing: " << name << ". Time taken (ms): " << ms << endl;
        return ms;
    }
};
Kashyap
  • 15,354
  • 13
  • 64
  • 103
  • ok, awaiting the actual benchmark testrunner – sehe Aug 31 '12 at 18:17
  • I don't know if this will be any help, but watch: https://www.youtube.com/watch?v=aMnn0Jq0J-E the video is named `Three Beautiful Quicksorts`. You'll love it (it's a technical talk about just that)! – Paul Aug 31 '12 at 18:18
  • *sorting algos* -> Instant "I'm not going to even bother reading your question." Seriously, is *algorithms*, five extra characters, too hard to type? – Drise Aug 31 '12 at 18:21
  • That `8,757.8` run you let run for _two and a half hours_? – Mooing Duck Aug 31 '12 at 18:24
  • also: look up `std::parition`, that's half the quicksort right there. – Mooing Duck Aug 31 '12 at 18:25
  • @Drise - fixed hope now you'll read and make some other equally useful contribution to the discussion.. – Kashyap Aug 31 '12 at 18:27
  • @MooingDuck it's microseconds.. so it was like 8 seconds.. I'm not THAT patient.. :), there is `std::merge` too.. but that kills the whole purpose of learning. – Kashyap Aug 31 '12 at 18:27
  • @thekashyap My point is, you took how much time formatting this as a nice post (which it is), and you didn't bother to type 10 characters? Anyhow. 2.5 seconds over 60k items doesn't seem bad to me. – Drise Aug 31 '12 at 18:34
  • Also, you should be doing each test individually. Things may happen when you do all tests in the same program. – Drise Aug 31 '12 at 18:36
  • @thekashyap microseconds or milliseconds? You're on Windows so I guess milliseconds :-) –  Aug 31 '12 at 18:41
  • @lttlrck, I've now posted the code that fetches the time. AFAIK it's microseconds, anyway it felt like it.. :) see `getTimeInMicroSecs()` above. Uses `QueryPerformanceFrequency()` and `QueryPerformanceCounter()` – Kashyap Sep 04 '12 at 17:52
  • Why is this an optimization? "insertion sort applied to the whole array at once instead of applying to each partition, left unsorted by quicksort." it sounds like it might decrease your locality of reference – Tim Lovell-Smith Oct 17 '12 at 04:56
  • 1
    @TimLovell-Smith sorry, I don't understand what you mean by "it sounds like it might decrease your locality of reference" ? In general it does improve the performance for larger arrays. – Kashyap Oct 24 '12 at 20:50
  • @thekashyap - thanks I see I wasn't thinking clearly - it would increase locality of reference, and ability for cache prediction, not decrease. – Tim Lovell-Smith Oct 25 '12 at 22:26

2 Answers2

5

You have a bug in the algorithm itself. E.g. there is undefined behaviour here:

The insertion sort potentially reads out-of-bounds in the marked following line:

*(hole--) = *(hole-1);

It reads before the first element. I suggest you meant

*hole = *(hole-1);
--hole;

Update quickly benchmarked with GNU g++ 4.6.1 on 64bit linux. I rearranged the timing to be at totals, so I didn't have to reimplement the clock functions (I'm lazy).

Adapted code: http://ideone.com/LgAgs

Built with

g++ -std=c++0x -g -O3 test.cpp -o test

Here's the result: The insertion sort appears to be roughly ~60-100x slower than the others.

START Testing: insertion_sort. With --- 8 elements.
START Testing: insertion_sort. With --- 16 elements.
START Testing: insertion_sort. With --- 32 elements.
START Testing: insertion_sort. With --- 64 elements.
START Testing: insertion_sort. With --- 128 elements.
START Testing: insertion_sort. With --- 256 elements.
START Testing: insertion_sort. With --- 512 elements.
START Testing: insertion_sort. With --- 1024 elements.
START Testing: insertion_sort. With --- 2048 elements.
START Testing: insertion_sort. With --- 4096 elements.
START Testing: insertion_sort. With --- 8192 elements.
START Testing: insertion_sort. With --- 16384 elements.
START Testing: insertion_sort. With --- 32768 elements.
START Testing: insertion_sort. With --- 65536 elements.

real    0m1.532s
user    0m1.524s
sys 0m0.004s
START Testing: quicksort. With --- 8 elements.
START Testing: quicksort. With --- 16 elements.
START Testing: quicksort. With --- 32 elements.
START Testing: quicksort. With --- 64 elements.
START Testing: quicksort. With --- 128 elements.
START Testing: quicksort. With --- 256 elements.
START Testing: quicksort. With --- 512 elements.
START Testing: quicksort. With --- 1024 elements.
START Testing: quicksort. With --- 2048 elements.
START Testing: quicksort. With --- 4096 elements.
START Testing: quicksort. With --- 8192 elements.
START Testing: quicksort. With --- 16384 elements.
START Testing: quicksort. With --- 32768 elements.
START Testing: quicksort. With --- 65536 elements.

real    0m0.025s
user    0m0.016s
sys 0m0.008s
START Testing: mixed_quicksort. With --- 8 elements.
START Testing: mixed_quicksort. With --- 16 elements.
START Testing: mixed_quicksort. With --- 32 elements.
START Testing: mixed_quicksort. With --- 64 elements.
START Testing: mixed_quicksort. With --- 128 elements.
START Testing: mixed_quicksort. With --- 256 elements.
START Testing: mixed_quicksort. With --- 512 elements.
START Testing: mixed_quicksort. With --- 1024 elements.
START Testing: mixed_quicksort. With --- 2048 elements.
START Testing: mixed_quicksort. With --- 4096 elements.
START Testing: mixed_quicksort. With --- 8192 elements.
START Testing: mixed_quicksort. With --- 16384 elements.
START Testing: mixed_quicksort. With --- 32768 elements.
START Testing: mixed_quicksort. With --- 65536 elements.

real    0m0.016s
user    0m0.004s
sys 0m0.008s
sehe
  • 374,641
  • 47
  • 450
  • 633
  • sorting: vector, yes, data is "exactly" the same. I'll update the post with tester code. – Kashyap Aug 31 '12 at 18:15
  • have added simplest timings on my system. Hope that helps – sehe Aug 31 '12 at 19:05
  • associativity of `operator =` is right to left, so `operator --` will be called "after" `*(hole - 1)` is executed. And `*(hole--) = *(hole-1);` is executed only if `hole > first`. So it should work. Did I miss something? – Kashyap Aug 31 '12 at 19:12
  • @thekashyap: `operator = is right to left`, so it will do the _assignments_ from left to right. But you only have one assignment. It evaluates each side of the assignment in a unspecified order. So the `hole--` can happen before `hole-1` – Mooing Duck Aug 31 '12 at 19:16
  • 1
    In fact, on GCC it _did_ happen before, because valgrind was happy to inform me about reading before the start of the allocated memory region – sehe Aug 31 '12 at 19:25
  • @JerryCoffin, thanks ! sehe, indeed, after you posted your results I also tried in CodeBlocks with GCC and got errors. Works with VS2010. So undefined it is.. :-) – Kashyap Aug 31 '12 at 20:50
0

So finally I think I figured out at least a part of what's wrong.

Thanks to sehe for the hint.

  • When compiled with -O3 (on GCC, or /Ox on MSVC) mixed_inplace is the fastest and pretty close to std::sort()
    • I suppose this means that at least some of the expected optimizations (tail-recursion) weren't applied by compiler when compiling at lower optimization level.
  • Build should be release build (w/o -g on GCC).
  • @sehe: insertion sort performance is irrelevant.
  • std::sort() implementation on GCC and MSVC are different, so it's not really right to compare the two.

Here are the results on windows and linux with and w/o optimization options:

Windows with MSVC: Windows with MSVC

Windows with GCC: enter image description here

RedHat Linux with GCC: RHEL with GCC

Kashyap
  • 15,354
  • 13
  • 64
  • 103
  • Although the implementation may differ, I think it is perfectly valid to compare `std::sort` on MSVC/GCC (it's like it's perfectly fine to compare performance across different CPUs to compare implementation of the CPUs _for specific tasks_) – sehe Sep 12 '12 at 19:31
  • @sehe philosophically: yes. :-).. But we're talking abt the actual implementation and optimizations inside.. ps: my `mixed_inplace_quicksort` was based on the MSVC's implementation.. have added Windows with GCC results now.. – Kashyap Sep 12 '12 at 21:38