79

According to Scott Meyers, in his Effective STL book - item 46. He claimed that std::sort is about 670% faster than std::qsort due to the fact of inline. I tested myself, and I saw that qsort is faster :( ! Could anyone help me to explain this strange behavior?

#include <iostream>
#include <vector>
#include <algorithm>

#include <cstdlib>
#include <ctime>
#include <cstdio>

const size_t LARGE_SIZE = 100000;

struct rnd {
    int operator()() {
        return rand() % LARGE_SIZE;
    }
};

int comp( const void* a, const void* b ) {
    return ( *( int* )a - *( int* )b );
}

int main() {
    int ary[LARGE_SIZE];
    int ary_copy[LARGE_SIZE];
    // generate random data
    std::generate( ary, ary + LARGE_SIZE, rnd() );
    std::copy( ary, ary + LARGE_SIZE, ary_copy );
    // get time
    std::time_t start = std::clock();
    // perform quick sort C using function pointer
    std::qsort( ary, LARGE_SIZE, sizeof( int ), comp );
    std::cout << "C quick-sort time elapsed: " << static_cast<double>( clock() - start ) / CLOCKS_PER_SEC << "\n";
    // get time again
    start = std::clock();
    // perform quick sort C++ using function object
    std::sort( ary_copy, ary_copy + LARGE_SIZE );
    std::cout << "C++ quick-sort time elapsed: " << static_cast<double>( clock() - start ) / CLOCKS_PER_SEC << "\n";
}

This is my result:

C quick-sort time elapsed: 0.061
C++ quick-sort time elapsed: 0.086
Press any key to continue . . .

Update

Effective STL 3rd Edition ( 2001 )
Chapter 7 Programming with STL
Item 46: Consider function objects instead of functions as algorithm parameters.

Anton Menshov
  • 2,266
  • 14
  • 34
  • 55
roxrook
  • 13,511
  • 40
  • 107
  • 156
  • 17
    Did you let your compiler optimize? Debug/unoptimized builds won't take full advantage of things like inlining. – Edward Strange Jan 16 '11 at 21:23
  • 5
    Understanding how quick sort works, would give you a better idea of how to test it, in short: 1. use a larger array, eg: 10^6 in size, then populate the array in descending order 999999... 4,3,2,1 - this will cause the sort to become O(n^2), doing this will effectively demonstrate why inlining comparators makes such a big difference in this particular algorithm. –  Jan 16 '11 at 21:24
  • 2
    @Zenikoder: No, that will not cause the sort to become quadradic. – Billy ONeal Jan 16 '11 at 21:26
  • 11
    @Zenikoder- Almost no implementations of `qsort` or `sort` will use a quicksort implementation that breaks on reverse-sorted inputs. The most common STL `sort` implementation uses introsort, which introspects on the quicksort routine to ensure it never degrades to worse than O(n lg n), and I'm fairly confident that the C `qsort` routine uses something similar (or at least a heuristic like median-of-three) to prevent this. – templatetypedef Jan 16 '11 at 21:26
  • Can you please provide the name of the EffC++ item? I have an older edition (3rd) and the numbers have changed. In my edition, item 46 doesn’t touch `std::sort` vs. `qsort` at all. – Konrad Rudolph Jan 16 '11 at 21:29
  • 3
    @Noah: According to a 06 article on artima SM: "I’ll begin with what many of you will find an unredeemably damning confession: I have not written production software in over 20 years, and I have never written production software in C++." He calls himself an archeologist/anthropologist of the C++ language. –  Jan 16 '11 at 21:30
  • 2
    Damn, I accidentally upvoted @Zendikoder’s comment. Nothing against you, but your comment is wrong. May I suggest to you the excellent Bentley&McIlroy paper “Engineering a sort function” which explains in detail how most `qsort` functions work, and Musser’s paper “Introspective sorting and selection algorithms”, which explains how `std::sort` works. – Konrad Rudolph Jan 16 '11 at 21:32
  • @Billy: you're right, std::sort wont as its guaranteed to be worst case O(nlogn), but qsort definitely will. –  Jan 16 '11 at 21:32
  • @Konrad Rudolph: Hello, see my update. Thanks for your concern. – roxrook Jan 16 '11 at 21:36
  • 1
    @Konrad: have you had a read of this paper: http://www.cs.dartmouth.edu/~doug/mdmspe.pdf –  Jan 16 '11 at 21:43
  • @Zenikoder: I highly doubt `qsort` will be that way either. Nobody in their right mind uses the "choose the last element as pivot" strategy as it falls apart of the source range is already sorted (which is relatively common). I would be extremely surprised if the pivot is not chosen at random in most every C implementation. – Billy ONeal Jan 16 '11 at 21:44
  • @Konrad Rudolph: Could you give me the link to those articles that you mentioned above about qsort and std::sort. Thanks in advance. – roxrook Jan 16 '11 at 21:45
  • @Billy: Don't most implementations use the pivot of three selection criteria? I think i'm going to open a question specifically regarding qsort, its getting very interesting. –  Jan 16 '11 at 21:48
  • @Zenikoder: Pivot of 3 would not result in quadradic performance with a reverse sorted array. – Billy ONeal Jan 16 '11 at 21:48
  • 3
    @Chan: The Bentley and McIlroy paper can be found here: http://www.cs.ubc.ca/local/reading/proceedings/spe91-95/spe/vol23/issue11/spe862jb.pdf –  Jan 16 '11 at 21:54
  • 3
    @Chan: The Musser paper can be found here: http://sist.sysu.edu.cn/~qiaohy/DS2010/introsort.pdf –  Jan 16 '11 at 21:57
  • @Zenikoder: Thanks a lot for those links – roxrook Jan 16 '11 at 22:04
  • 1
    @Zenikoder: I haven’t read the paper but I know it, and the strategy described therein. But that doesn’t make your first comment right, which described a fundamentally different strategy. Apart from that, introsort (and thus `std::sort`) is robust even against this attack since it strictly bounds the number of recursions. – Konrad Rudolph Jan 16 '11 at 22:06
  • Did you let compiler optimize code? if you use gcc or clang, use -O3, in Visual c++ use /Ox – ivan.ukr Oct 05 '17 at 20:28

8 Answers8

105

std::clock() is not a viable timing clock. You should use a platform-specific higher resolution timer, like the Windows High Performance Timer. More than that, the way that you call clock() is that first, text is output to the console, which is included in the time. This definitely invalidates the test. In addition, make sure that you compiled with all optimizations.

Finally, I copied and pasted your code, and got 0.016 for qsort and 0.008 for std::sort.

user438383
  • 5,716
  • 8
  • 28
  • 43
Puppy
  • 144,682
  • 38
  • 256
  • 465
  • 2
    @DeadMG: Thanks! I changed to release mode, and I got the similar result. I really love Scott Meyers, and I trust his word ;) – roxrook Jan 16 '11 at 21:27
  • Text seems to be is output on both cases so it can not exactly make the outcome invalid. – Öö Tiib Jan 16 '11 at 21:29
  • 2
    @Oo Tiib: Text being output doesn't mean that it doesn't output in the same time. What if the buffer is somewhat bigger than the first but less than the second? Now it has to flush before the second call- but it didn't in the first call. Oh dear. I'm not very happy though because I fixed all the above problems and qsort is now a lot faster. :( – Puppy Jan 16 '11 at 21:33
  • @DeadMG: How is qsort a lot faster? Could you explain this? – roxrook Jan 16 '11 at 21:41
  • @DeadMG: Did you switch to using accurate performance counters instead of `std::clock` in your improved version? – Billy ONeal Jan 16 '11 at 21:52
  • @Billy ONeal: sorry for my limitint knowledge. I only switched to Release. Should I apply another function object using subtraction instead of less<>? I think I should ask another question about how to perform Benchmark properly. – roxrook Jan 16 '11 at 21:56
  • @Chan: Wouldn't it make more sense to make your comparison function operate in terms of `<` ? My comment to DeadMG above is about the timer though. (Oh: and no reason to be sorry) – Billy ONeal Jan 16 '11 at 22:13
  • @Billy: Of course I did. Every point I mentioned above, I fixed. Now std::sort is eight times slower than qsort. I even reversed the order so that std::sort will have the cache advantage and it's still very slow. – Puppy Jan 16 '11 at 22:47
  • @DeadMG: Couple things. First, this is my code: http://pastebin.com/sjPWeNCQ , and I got "Qsort took 28827688 ticks. std::sort took 21978215 ticks." -- a win for `std::sort`. I got your factor of 8 win for `qsort` when I forgot to add `*`s before the casts in the comparison function -- comparing the addresses instead of the integers themselves. (Which makes sense as the original array would have already been sorted). Maybe when you "fixed" the comparison you did the same thing? – Billy ONeal Jan 16 '11 at 23:51
  • You didn't set the qsort comparison to `<`, you left it as `-`. That's a false comparison to std::sort's default predicate. That change is what killed std::sort's relative performance for me. – Puppy Jan 17 '11 at 00:02
  • 1
    @DeadMG: `std::qsort` requires "The return value of this function should represent whether elem1 is considered less than, equal to, or greater than elem2 by returning, respectively, a negative value, zero or a positive value." `operator<` does not meet that requirement (specifically it returns only 0 or 1). Check to make sure that `std::sort` and `std::qsort` produce the same results in your testing :) (Just changing `-` to `<` results in `qsort` returning the wrong answer for me) – Billy ONeal Jan 17 '11 at 00:04
  • @Billy: Ah, interesting. I didn't realize that it didn't take a binary predicate like std::sort did. That would explain it. – Puppy Jan 17 '11 at 00:10
  • Isn't the simple `return *a - *b` comparison subject to incorrect results due to integer overflow? E.g. large positive *a, large negative *b. – Andrew Janke Aug 31 '11 at 14:30
  • @Andrew Janke: He used modulo to limit the size. – Puppy Aug 31 '11 at 22:24
24

I am surprised that no one mentions caches.

In your code, you start by touching ary and ary_copy so they are resident in the cache at the time of qsort. During qsort, ary_copy might get evicted. At the time of std::sort, the elements would have to be fetched from memory or a larger (read slower) cache level. This will of course depend on your cache sizes.

Try to reverse the test, i.e., start by running std::sort.

As some people have pointed out; making the array larger will make the test more fair. The reason is that a large array is less likely to fit in cache.

user438383
  • 5,716
  • 8
  • 28
  • 43
rasmus
  • 3,136
  • 17
  • 22
  • 7
    I am surprised no one mentioned any strategies to measure the actual effectiveness of the code. You can write a tiny program that sorts a few hundred elements, get everything loaded into your L1 cache, and rip through that in record time, but that is in no way going to reflect your actual program running on a system with a few hundred other active processes, doing context switches because you're compute-bound and the scheduler hates you, while sorting a dataset the size of New Jersey. Make your benchmark look much like the real application. – Wexxor Jul 12 '13 at 22:37
14

The two sorting algorithms, without optimizations enabled, should have comparable performance. The reason that the C++ sort tends to appreciably beat qsort is that the compiler can inline the comparisons being made, since the compiler has type information about what function is being used to perform the comparison. Did you run these tests with optimization enabled? If not, try turning it on and running this test again.

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
  • Thanks! I'm using Visual Studio, and I really don't know how to turn optimization on. – roxrook Jan 16 '11 at 21:24
  • 3
    @Chan: Switch to using the "Release" build. Also make sure you don't run the program from whithin visual studio for your benchmarks -- things like debuggers will change the time characteristics of your program. – Billy ONeal Jan 16 '11 at 21:24
  • @Billy ONeal: I switched to Release, and I got the expected result. Happy ^_^ ! – roxrook Jan 16 '11 at 21:28
13

Another reason that qsort may perform much better than expected is that newer compilers can inline and optimize through the function pointer.

If the C header defines an inline implementation of qsort instead of implementing it inside of a library and the compiler supports indirect function inlining, then qsort can be just as fast as std::sort.

Zan Lynx
  • 53,022
  • 10
  • 79
  • 131
6

Writing accurate benchmarks is difficult, so let's get Nonius to do it for us! Let's test qsort, std::sort with no inlining, and std::sort with inlining (the default) on a vector of a million random integers.

// sort.cpp
#define NONIUS_RUNNER
#include <nonius.h++>
#include <random>
#include <algorithm>

// qsort
int comp(const void* a, const void* b) {
    const int arg1 = *static_cast<const int*>(a);
    const int arg2 = *static_cast<const int*>(b);

    // we can't simply return a - b, because that might under/overflow
    return (arg1 > arg2) - (arg1 < arg2);
}

// std::sort with no inlining
struct compare_noinline {
    __attribute__((noinline)) bool operator()(const int a, const int b) {
        return a < b;
    }
};

// std::sort with inlining
struct compare {
    // the compiler will automatically inline this
    bool operator()(const int a, const int b) {
        return a < b;
    }
};

std::vector<int> gen_random_vector(const size_t size) {

    std::random_device seed;
    std::default_random_engine engine{seed()};
    std::uniform_int_distribution<int> dist{std::numeric_limits<int>::min(), std::numeric_limits<int>::max()};

    std::vector<int> vec;
    for (size_t i = 0; i < size; i += 1) {
        const int rand_int = dist(engine);
        vec.push_back(rand_int);
    }

    return vec;
}

// generate a vector of a million random integers
constexpr size_t size = 1'000'000;
static const std::vector<int> rand_vec = gen_random_vector(size);

NONIUS_BENCHMARK("qsort", [](nonius::chronometer meter) {

    // Nonius does multiple runs of the benchmark, and each one needs a new
    // copy of the original vector, otherwise we'd just be sorting the same
    // one over and over
    const size_t runs = static_cast<size_t>(meter.runs());
    std::vector<std::vector<int>> vectors{runs};
    std::fill(vectors.begin(), vectors.end(), rand_vec);

    meter.measure([&](const size_t run) {

        std::vector<int>& current_vec = vectors[run];

        std::qsort(current_vec.data(), current_vec.size(), sizeof(int), comp);

        return current_vec;
    });
});

NONIUS_BENCHMARK("std::sort noinline", [](nonius::chronometer meter) {

    const size_t runs = static_cast<size_t>(meter.runs());
    std::vector<std::vector<int>> vectors{runs};
    std::fill(vectors.begin(), vectors.end(), rand_vec);

    meter.measure([&](const size_t run) {

        std::vector<int>& current_vec = vectors[run];

        std::sort(current_vec.begin(), current_vec.end(), compare_noinline{});

        return current_vec;

    });
});

NONIUS_BENCHMARK("std::sort inline", [](nonius::chronometer meter) {

    const size_t runs = static_cast<size_t>(meter.runs());
    std::vector<std::vector<int>> vectors{runs};
    std::fill(vectors.begin(), vectors.end(), rand_vec);

    meter.measure([&](const size_t run) {

        std::vector<int>& current_vec = vectors[run];

        std::sort(current_vec.begin(), current_vec.end(), compare{});

        return current_vec;

    });
});

Compiling with Apple Clang 7.3.0,

$ clang++ -std=c++14 -stdlib=libc++ -O3 -march=native sort.cpp -o sort
$ ./sort

and running it on my 1.7 GHz i5 Macbook Air, we get

qsort                211 ms +/- 6 ms
std::sort noinline   127 ms +/- 5 ms
std::sort inline      87 ms +/- 4 ms

So std::sort with no inlining is about 1.7x faster than qsort (perhaps due to different sorting algorithms), and inlining bumps that up to about 2.4x faster. Certainly an impressive speedup, but much less than 670%.

Mr Bingley
  • 354
  • 3
  • 10
  • 1
    Can this be a case when CPU caches values so consectutive access to them is much faster, so if you, for example, will run std::sort test before qsort it will show that qsort is faster? – Soup Endless Feb 06 '21 at 14:39
6

On my machine adding some meat (making the array 10 million elements and moving it in the data section) and compiling with

g++ -Wall -O2 -osortspeed sortspeed.cpp

I get as result

C quick-sort time elapsed: 3.48
C++ quick-sort time elapsed: 1.26

Be also careful about modern "green" CPUs that may be configured to run at a variable speed depending on the load of the system. When benchmarking, this kind of behavior can drive you crazy (on my machine I've a small script that fixes CPU clock that I use when making speed tests).

6502
  • 112,025
  • 15
  • 165
  • 265
  • 6
    "Green" CPUs don't matter if you're using performance counters (as you should be doing for meaningful benchmark results) – Billy ONeal Jan 16 '11 at 21:50
  • Performance counters are great, but clock is not that bad if you're not trying to measure small stuff. Also clock() is per-process, perf counters are global. – 6502 Jan 16 '11 at 21:59
  • 2
    @6502: You have that reversed. Perf counters are per process, clock is global. – Billy ONeal Jan 16 '11 at 23:49
  • @Billy ONeal: I thought you meant RDTSC and that's very nice but global. And no, clock() is a per-process counter. See http://www.cs.utah.edu/dept/old/texinfo/glibc-manual-0.02/library_19.html – 6502 Jan 17 '11 at 00:29
  • 1
    @6502: glibc != standard c. Usually I believe these things **are** implemented in terms of `rdtsc`, but the OS keeps track of what the timestamps are when it performs a context switch, and restores these values when the context is given back to the process being measured. – Billy ONeal Jan 17 '11 at 00:35
  • @Billy ONeal: Neat to know ... I always used RDTSC directly and so system load was a problem. About clock IIRC it's a POSIX requirement to have clock() working per-process but indeed I wouldn't be surprised to find that windows doesn't conform (portability is not their main concern... apparently au contraire actually: according to msdn VS6.0 clock was per-process, 8.0 is wall clock - see http://msdn.microsoft.com/en-us/library/4e2ess30(VS.80).aspx and http://msdn.microsoft.com/en-us/library/aa272059(VS.60).aspx ) – 6502 Jan 17 '11 at 01:31
3

Why there's no one mentions that extra memory fetch in the C standard library's compare function?

In the C standard library, void* is used to hold all types of member data, which means that when the member data is actually accessed, the void* pointer must be performed once additional dereference.

struct list {
        void *p_data; // deference this pointer when access real data
        struct list *prev;
        struct list *next;
};

However, in STL, with the help of template's code generation capabilities, member data saved with void* in the C standard library can be directly placed inside the type, avoiding additional dereferences during access.

template <typename T>
class list {
public:
        T data; // access data directly
        list *prev;
        list *next;
};

So, in theory, std::sort is faster than qsort.

2

I am not sure with 670% faster. It must have been a specific dataset tailored to show the speed of std::sort. In general, std::sort is indeed faster than qsort because of a couple of these things:

  1. qsort operates on void*, which first requires a dereference, and second requires the size of the data type to perform the swaps. Therefore, the swap operation of qsort is done every byte. Look at qsort implementation and notice its SWAP macro is a loop. Here's the video by Jon Bentley explaining the timing differences (starts at 45m): https://www.youtube.com/watch?v=aMnn0Jq0J-E&t=2700s

  2. The inline may make it speed up a bit, but that's micro-optimization, not the major contributor.

  3. std::sort is actually a hybrid algorithm called Introsort. C qsort is a pure quicksort implementation. Given a dataset that's terrible for quicksort, std::sort changes to heapsort instead. So if you create a bad input for qsort, it will be unbearably slow.

  4. The profiling code above is insufficient. Input size should be increased. 100K is hardly enough. Increase it to 1M or 10M, then repeat the sorting multiple times then take the average or median. If necessary, compile them into a separate binary and run them separately.

garbagecollector
  • 3,731
  • 5
  • 32
  • 44