What is the fastest way to sort n strings of length n each?

Question

I have n strings, each of length n. I wish to sort them in ascending order.

The best algorithm I can think of is n^2 log n, which is quick sort. (Comparing two strings takes O(n) time). The challenge is to do it in O(n^2) time. How can I do it?

Also, radix sort methods are not permitted as you do not know the number of letters in the alphabet before hand.

No limit stated as such, so I think we can assume 10^4 or larger — piedpiper, Jun 27 '14 at 02:57
Well, you can go through the N^2 letters in the strings to count the number of letters in the alphabet (which takes only O(N^2) time), and then use radix sort... — T.C., Jun 27 '14 at 02:59
@T.C. How do you intend to count the numbers in O(n^2) time? — piedpiper, Jun 27 '14 at 03:03
@ashu http://stackoverflow.com/questions/13865094/how-to-count-distinct-values-in-a-list-in-linear-time — T.C., Jun 27 '14 at 03:04
@T.C. Thanks for that. This solution though, is still not O(n^2) in the worst case. Any insights on that? — piedpiper, Jun 27 '14 at 03:06
Would a trie work? Insert all items in O(n^2), then extract sorted in O(n log n). — Ry-, Jun 27 '14 at 03:09
I can't think of Unicode as 65536 characters. It's a very odd number; The BMP has fewer characters but the whole character set has far more; Codepoints are not in contiguous blocks; Surrogates are required to be paired;...??? — Tom Blodget, Jun 27 '14 at 03:35
Radix sort *is* applicable here -- all you need is a way to represent each character in a bounded number of bits, and a way to get from any character to the next (or previous) one in a string in O(1) -- though it's likely to be slower than a regular comparison sort when the bit-width of the characters greatly exceeds log(n) as seems likely here. — j_random_hacker, Jun 27 '14 at 06:30

Danyu · Answer 1 · 2014-06-27T03:23:15.293

0

Assume any letter is a to z.

Since no requirement for in-place sorting, create an array of linked list with length 26:

List[] sorted= new List[26]; // here each element is a list, where you can append

For a letter in that string, its sorted position is the difference of ascii: x-'a'. For example, position for 'c' is 2, which will be put to position as

sorted[2].add('c')

That way, sort one string only take n.

So sort all strings takes n^2.

For example, if you have "zdcbacdca".

z goes to sorted['z'-'a'].add('z'),
d goes to sorted['d'-'a'].add('d'),
....

After sort, one possible result looks like

0   1  2  3 ...  25  <br/>
a   b  c  d ...  z   <br/>
a   b  c             <br/>
       c

Note: the assumption of letter collection decides the length of sorted array.

edited Jun 27 '14 at 03:23

answered Jun 27 '14 at 03:11

Danyu

509
3
7

"zdcbacdca" is just a single string, I think you misunderstood the question – Pham Trung Jun 27 '14 at 03:24
you want to sort all strings together in ascending order?！ – Danyu Jun 27 '14 at 03:26
So,for example, we have a input is zzz, abc, bcd, acd, right? and the expected output is abc, acd, bcd, zzz :) – Pham Trung Jun 27 '14 at 03:31

score 0 · Answer 2 · answered Jun 27 '14 at 07:51

For small numbers of strings a regular comparison sort will probably be faster than a radix sort here, since radix sort takes time proportional to the number of bits required to store each character. For a 2-byte Unicode encoding, and making some (admittedly dubious) assumptions about equal constant factors, radix sort will only be faster if log2(n) > 16, i.e. when sorting more than about 65,000 strings.

One thing I haven't seen mentioned yet is the fact that a comparison sort of strings can be enhanced by exploiting known common prefixes.

Suppose our strings are S[0], S[1], ..., S[n-1]. Let's consider augmenting mergesort with a Longest Common Prefix (LCP) table. First, instead of moving entire strings around in memory, we will just manipulate lists of indices into a fixed table of strings.

Whenever we merge two sorted lists of string indices X[0], ..., X[k-1] and Y[0], ..., Y[k-1] to produce Z[0], ..., Z[2k-1], we will also be given 2 LCP tables (LCPX[0], ..., LCPX[k-1] for X and LCPY[0], ..., LCPY[k-1] for Y), and we need to produce LCPZ[0], ..., LCPZ[2k-1] too. LCPX[i] gives the length of the longest prefix of X[i] that is also a prefix of X[i-1], and similarly for LCPY and LCPZ.

The first comparison, between S[X[0]] and S[Y[0]], cannot use LCP information and we need a full O(n) character comparisons to determine the outcome. But after that, things speed up.

During this first comparison, between S[X[0]] and S[Y[0]], we can also compute the length of their LCP -- call that L. Set Z[0] to whichever of S[X[0]] and S[Y[0]] compared smaller, and set LCPZ[0] = 0. We will maintain in L the length of the LCP of the most recent comparison. We will also record in M the length of the LCP that the last "comparison loser" shares with the next string from its block: that is, if the most recent comparison, between two strings S[X[i]] and S[Y[j]], determined that S[X[i]] was smaller, then M = LCPX[i+1], otherwise M = LCPY[j+1].

The basic idea is: After the first string comparison in any merge step, every remaining string comparison between S[X[i]] and S[Y[j]] can start at the minimum of L and M, instead of at 0. That's because we know that S[X[i]] and S[Y[j]] must agree on at least this many characters at the start, so we don't need to bother comparing them. As larger and larger blocks of sorted strings are formed, adjacent strings in a block will tend to begin with longer common prefixes, and so these LCP values will become larger, eliminating more and more pointless character comparisons.

After each comparison between S[X[i]] and S[Y[j]], the string index of the "loser" is appended to Z as usual. Calculating the corresponding LCPZ value is easy: if the last 2 losers both came from X, take LCPX[i]; if they both came from Y, take LCPY[j]; and if they came from different blocks, take the previous value of L.

In fact, we can do even better. Suppose the last comparison found that S[X[i]] < S[Y[j]], so that X[i] was the string index most recently appended to Z. If M ( = LCPX[i+1]) > L, then we already know that S[X[i+1]] < S[Y[j]] without even doing any comparisons! That's because to get to our current state, we know that S[X[i]] and S[Y[j]] must have first differed at character position L, and it must have been that the character x in this position in S[X[i]] was less than the character y in this position in S[Y[j]], since we concluded that S[X[i]] < S[Y[j]] -- so if S[X[i+1]] shares at least the first L+1 characters with S[X[i]], it must also contain x at position L, and so it must also compare less than S[Y[j]]. (And of course the situation is symmetrical: if the last comparison found that S[Y[j]] < S[X[i]], just swap the names around.)

I don't know whether this will improve the complexity from O(n^2 log n) to something better, but it ought to help.

score 0 · Answer 3 · answered Jun 12 '17 at 04:33

0

You can build a Trie, which will cost O(s*n),

Details: https://stackoverflow.com/a/13109908

answered Jun 12 '17 at 04:33

Sazzad Hissain Khan

37,929
33
189
256

score -1 · Answer 4 · answered Jun 27 '14 at 03:06

Solving it for all cases should not be possible in better that O(N^2 Log N). However if there are constraints that can relax the string comparison, it can be optimised.

-If the strings have high repetition rate and are from a finite ordered set. You can use ideas from count sort and use a map to store their count. later, sorting just the map keys should suffice. O(NMLogM) where M is the number of unique strings. You can even directly use TreeMap for this purpose.

-If the strings are not random but the suffixes of some super string this can well be done O(N Log^2N). http://discuss.codechef.com/questions/21385/a-tutorial-on-suffix-arrays

What is the fastest way to sort n strings of length n each?

4 Answers4