5

I'd like to know if there's some faster way to sort such array than quicksort/mergesort.

Maximum array's length is 10^6. Word's length is >=10 and <= 100 and the word can contain a-z and spaces (27 different characters in total). Characters are not unique in the words (they can repeat). All the words in an array are equally long.

Peter O.
  • 32,158
  • 14
  • 82
  • 96
Patryk
  • 3,042
  • 11
  • 41
  • 83
  • There is no "fastest" way, if you have no information about the possible order of the incoming data. You have to pick one of the popular algorithms based on best-case vs worst-case performance (and how likely each is) and storage/data access constraints. – Hot Licks Oct 28 '12 at 15:00
  • Is it assumed that the entire array can fit in memory? – goat Oct 28 '12 at 16:18

5 Answers5

8

You can put all the words in a trie (or a radix tree), and then print it in a DFS order, starting from the "smaller" lexicographical letter at each level in the DFS.

This solution will be O(n* |S|) where |S| is the average string length.

Simple example:

Let the set of strings be [ac,ab,aca]:

The resulting trie will be:

         a
       /  \
      /    \
     b      c
     |     / \
     $    $   a
              |
              $

And a DFS (which prefers lexicographically smaller characters): The DFS will start from a, go to b, and then to the end sign ($) and will first print ab, then go back to a, and right to c, and to the next $ sign, and will print ac, and next to a and its $ and will print aca, resulting in printing:

ab
ac
aca

As expexted.

amit
  • 175,853
  • 27
  • 231
  • 333
  • But radix tree is one of the more complicated algorithms to implement, and the cost of storage management can easily swamp the supposed gains in `O` efficiency. – Hot Licks Oct 28 '12 at 17:12
1

The lower bound for any comparison based sort is O(nlog(n)). You can't have any sorting algorithm based on comparing elements to each other that runs on a worst case lower than this limit.

both merge sort and heap sort have a worst case running time of O(nlog(n))... And quick sort have a worst case running time of O(n^2), but the average running time is O(n^log(n)).

It is worth to mention that although quick sort have a worst time running time of O(N^2), it sometimes beats others algorithms with O(nlog(n)) running time (like heapsort) due to having a small constant factor and suitability for efficient execution on current machine architectures.

linear sorting algorithms that allows for sorting integers (but not only limited to them) in linear time O(n) on a non comparative basis (Examples: counting sort, bucket sort, and radix sort)

MSD radix sort can sort strings using lexicographic ally order of digits (in this case characters) and from the left to the right.

It sorts all strings first using the leftmost character using another linear sorting algorithm (say bucket sort), then sort them again using the second from the left character and so on until they are sorted by the rightmost character. At the end the array will be completely sorted.

This algorithm will have a running time of O(k*N) where N is the number of elements, and k is the average key length (word length in this case it will be >=10 && <=100).

Moustafa Alzantot
  • 397
  • 1
  • 4
  • 16
1

Well I've read (and upvoted) an answers about radix sort and radix trie, very informative.
But.
In case of radix sort - you need to make 91 passes of N elements, so it will be 91 * N. I'm not talking about additional space.
In case of mergesort you have N * log N compares, and since log N = log 1000000 ~ 20 you got 20 * N compares.

So which one is faster? :) Or may be I've mistaken somewhere?

Roman Pekar
  • 107,110
  • 28
  • 195
  • 197
  • 1
    But mergesort requires reading the entire string in each iteration (worst case, unless you can provide a better analysis), and in radix sort, each compare is done on a single character in the string, so though you have more compare ops, each is significantly cheaper, since it does not need to read the entire string. [P.S. thanks for upvoting :) ] – amit Oct 28 '12 at 16:31
  • you're right, mergesort compares and radix just pass through. It's trivial, thank you for pointing that. Mergesort will surely do a little better than reading whole string in each iteration, but I don't think it'll help to overtake radix sort. – Roman Pekar Oct 28 '12 at 16:34
0

The ascii values can be computed so essentially this is an integer sort. Comparison based sorting routines will at best get you O(n lg n) - Merge Sort (with additional space needed for creating two more arrays of size n/2) or O(n^2) at worst (insertion sort, quicksort, but they have no additional space complexity). These are asymptotically slower than a linear sorting algorithm. I recommend looking at CLRS (http://www.amazon.com/Introduction-Algorithms-Thomas-H-Cormen/dp/0262033844). The chapter on sorting in linear time. O(n) is probably the best you can do in this scenario. Also, this post might help. Sorting in linear time?

I'd check out radix sort. http://en.wikipedia.org/wiki/Radix_sort

Community
  • 1
  • 1
The Internet
  • 7,959
  • 10
  • 54
  • 89
0

Why not a distribution sort per three characters: that would need a count storage of 19683 (27*27*27) elements, which should be feasible, and then at most 34 passes are needed.

But very soon, the sublists per key (multiple of three characters) will be short enough to use an insertion sort or alike on the remaining of the string. 1.000.000/(27^3) is about 50

The same mechanism can be used with longer keys if they have long prefixes in common i.e. and the first 30 characters would divide the list in only 20 or 30 sublists. Then you don't represent the keys as numbers, but as strings, and store them in a dictionairy, which is slower, but less passes are needed then, and maybe less memory as well. Also it would need around N*log(M) lookups with M the number of different keys in a binairy tree, but hashing is also a possibility.

Jan Boonen
  • 95
  • 7