which sorting algorithm to use where?

Question

There are varieties of sorting algorithms available. Sorting algorithm with time complexity of O(n^2) may be suited over O(nlogn), because it is in-place or it is stable. For example:

For somewhat sorted things insertion sort is good.
Applying quick sort on nearly sorted array is foolishness.
Heap sort is good with O(nlogn) but not stable.
Merge sort can't be used in embedded systems as in worst case it requires O(n) of space complexity.

I want to know which sorting algorithm is suitable in what conditions.

Which sorting algo is best for sorting names in alphabetical order?
Which sorting algo is best for sorting less integers?
Which sorting algo is best for sorting less integers but may be large in range (98767 – 6734784)?
Which sorting algo is best for sorting billions of integers?
Which sorting algo is best for sorting in embedded systems or real time systems where space and time both are constraints?

Please suggest these/other situations, books or website for these type of comparisons.

I found an automatic translation of the Hindi text: `Bren Rahiman to see, is not short Dari. As Ava needle work, said Tlwari storm.` What is the correct translation? — Anderson Green, Dec 15 '12 at 06:43
There's no simple answers for that, and in most cases caring about which is "best" is excessive (premature optimisation) anyway - you just use the sort from the standard library for you language. However, when sorting integers, radix sort is worth knowing. — , Dec 15 '12 at 06:45
@AndersonGreen It means 'Don't under-estimate small things over big ones. May be sword is bigger than sword but it can't do what needle can do. Needle is equally important'. — unknown_boundaries, Dec 15 '12 at 07:00

amit · Answer 1 · 2012-12-15T08:26:08.167

Well, there is no silver bullet - but here are some rules of thumb:

Radix sort/ Counting sort is usually good when the range of elements (let it be U) is relatively small comparing to the number of elements (U<<n) (might fit your case 2,4)
Insertion sort is good for small (say n<30) lists, even faster then O(nlogn) algorithms (empirically). In fact, you can optimize an O(nlogn) top-down algorithm by switching to insertion sort when n<30
A variation of radix sort might also be a good choice for sorting strings alphabetically, since it is O(|S|*n), while normal comparing based algorithm is O(|S|*nlogn) [where |S| is the length of your string]. (fits your case 1)
Where the sorted input is very large, way too large to fit in merge, the way to do it is with external sort - which is a variation or merge sort, it minimizes the number of disk reads/writes and makes sure these are done sequentially - because it improves the performance drastically. (might fit case 4)
For general case sorting, quick sort and timsort (used for java) gives good performance.

+1 - my only complaint is that AFAIK "external sort" doesn't mean any particular sort algorithm. Any sort that works with external storage is an external sort. Merge sort is definitely a good algorithm for external sorting, but not necessarily the only one. In particular, the cache-oblivious funnel sort should work well for data on a hard drive, if only you can find someone who understands it well enough to implement it. — , Dec 17 '12 at 00:56

Steve Jessop · Answer 2 · 2012-12-15T12:05:45.883

Merge sort can't be used in embedded systems as in worst case it requires O(n) of space complexity.

You may be interested in the stable_sort function from C++. It tries to allocate the extra space for a regular merge sort, but if that fails it does an in-place stable merge sort with inferior time complexity (n * ((log n)^2) instead of n * (log n)). If you can read C++ you can look at the implementation in your favourite standard library, otherwise I expect you can find the details explained somewhere in language-agnostic terms.

There's a body of academic literature about in-place stable sorting (and in particular in-place merging).

So in C++ the rule of thumb is easy, "use std::stable_sort if you need a stable sort, otherwise use std::sort". Python makes it even easier again, the rule of thumb is "use sorted".

In general, you will find that a lot of languages have fairly clever built-in sort algorithms, and you can use them most of the time. It's rare that you'll need to implement your own to beat the standard library. If you do need to implement your own, there isn't really any substitute for pulling out the textbooks, implementing a few algorithms with as many tricks as you can find, and testing them against each other for the specific case you're worried about for which you need to beat the library function.

Most of the "obvious" advice that you might be hoping for in response to this question is already incorporated into the built-in sort functions of one or more common programming languages. But to answer your specific questions:

Which sorting algo is best for sorting names in alphabetical order?

A radix sort might edge out standard comparison sorts like C++ sort, but that might not be possible if you're using "proper" collation rules for names. For example, "McAlister" used to be alphabetized the same as "MacAlister", and "St. John" as "Saint John". But then programmers came along and wanted to just sort by ASCII value rather than code a lot of special rules, so most computer systems don't use those rules any more. I find Friday afternoon is a good time for this kind of feature ;-) You can still use a radix sort if you do it on the letters of the "canonicalized" name rather than the actual name.

"Proper" collation rules in languages other than English are also entertaining. For example in German "Grüber" sorts like "Grueber", and therefore comes after "Gruber" but before "Gruhn". In English the name "Llewellyn" comes after "Lewis", but I believe in Welsh (using the exact same alphabet but different traditional collation rules) it comes before.

For that reason, it's easier to talk about optimizing string sorts than it is to actually do it. Sorting strings "properly" requires being able to plug in locale-specific collation rules, and if you move away from a comparison sort then you might have to re-write all your collation code.

Which sorting algo is best for sorting less integers?

For a small number of small values maybe a counting sort, but Introsort with a switch to insertion sort when the data gets small enough (20-30 elements) is pretty good. Timsort is especially good when the data isn't random.

Which sorting algo is best for sorting less integers but may be large in range (98767 – 6734784)?

The large range rules out counting sort, so for a small number of widely-ranged integers, Introsort/Timsort.

Which sorting algo is best for sorting billions of integers?

If by "billions" you mean "too many to fit in memory" then that changes the game a bit. Probably you want to divide the data into chunks that do fit in memory, Intro/Tim sort each one, then do an external merge. Of if you're on a 64 bit machine sorting 32 bit integers, you could consider counting sort.

Which sorting algo is best for sorting in embedded systems or real time systems where space and time both are constraints?

Probably Introsort.

For somewhat sorted things insertion sort is good.

True, and Timsort takes advantage of the same situation.

Applying quick sort on nearly sorted array is foolishness.

False. Nobody uses the plain QuickSort originally published by Hoare, you can make better choices of pivot that make the killer cases much less obvious than "sorted data". To deal with the bad cases thoroughly there is Introsort.

Heap sort is good with O(nlogn) but not stable.

True, but Introsort is better (and also not stable).

Merge sort can't be used in embedded systems as in worst case it requires O(n) of space complexity.

Handle this by allowing for somewhat slower in-place merging like std::stable_sort does.

Give Hoare some credit. His paper, legally available from oxfordjournals.org, includes picking the pivot randomly as well as median of e.g. three. — user515430, Dec 15 '12 at 15:13
Merge sort can be done in-place. Knuth outlines a solution due to Kronrod in section 5.2.4 of The Art of Computer Programming. There is a description of another algorithm in Huang and Langston's paper in CACM Vol 31 (1988), pp 348-352. — user515430, Dec 15 '12 at 23:46
@user515430: right, "plain QuickSort" is in effect a straw man, an algorithm nobody uses but which introduces the idea of sort-by-partition using a trivial pivot selection. But for some reason people continue to think that QuickSort has a problem with already-sorted input. I don't blame Hoare. As you say, his paper makes clear that the choice of pivot is important but not obvious, and he doesn't stipulate any particular choice there. I don't know when the idea of actually *choosing* the first element as pivot was introduced: maybe in his sample code elsewhere or maybe by someone else. — Steve Jessop, Dec 16 '12 at 16:50
And for a little more info on in-place merging see my old answer (http://stackoverflow.com/questions/2571049/how-to-sort-in-place-using-the-merge-sort-algorithm). AFAIK the techniques that hit the theoretical complexity bounds are slow in practice, so they're not good all-purpose choices. As I said originally, in-place merge is possible but slower. — Steve Jessop, Dec 16 '12 at 16:54
Even when merge sort is done simply, and not in-place, it needs O(n) extra space, but doesn't necessarily need O(n) RAM at all - not even for the original data. That's one reason it was popular in ancient times - machines with virtually no RAM but with several tape drives could run a merge sort by running the tapes through however many times was needed. In principle you could use the same trick today to sort terabyte-scale datasets with a minimum of hard-disk head thrashing, so long as you have at least two physical hard drives. — , Dec 17 '12 at 00:16

which sorting algorithm to use where?

2 Answers2