1

I have a very big sequence of strings. Length of each string is 50. Each string includes only chars from english ABC. What is the best(the fastest) way to sort this sequence?

xanatos
  • 109,618
  • 12
  • 197
  • 280
Siarhei Fedartsou
  • 1,843
  • 6
  • 30
  • 44
  • 2
    Is the built-in [Sort](http://msdn.microsoft.com/en-us/library/234b841s.aspx) method not fast enough? – dtb Mar 06 '11 at 10:57
  • No, if size of sequence is 500 gb - it is not enough :) I've heard that there is special algorithm for this case – Siarhei Fedartsou Mar 06 '11 at 10:59
  • 1
    Mmh... I think you're referring to [Radix-sort](http://en.wikipedia.org/wiki/Radix_sort), not sure if it's applicable here and you have to implement it from scratch though... – digEmAll Mar 06 '11 at 11:02
  • I'm not sure how to exploit the Length restriction here. What about a [parallel quick sort](http://stackoverflow.com/questions/1897458)? – dtb Mar 06 '11 at 11:03
  • What are you going to do with the data? For certain cases you could use a prefix tree/trie. – flight Mar 06 '11 at 11:04
  • How long are they, and is their content uniformly distributed? – CodesInChaos Mar 06 '11 at 11:04
  • @quasiverse I'm going to sort this data. It is my homework :) @CodeInChaos No, it is a random sequence – Siarhei Fedartsou Mar 06 '11 at 11:10
  • AnsiStrings or WideStrings? Under IA32 or IA64 CPU? – GJ. Mar 06 '11 at 11:19
  • Just to be sure, you have only `[ABC]` chars, or `[a-zA-Z]` ? – digEmAll Mar 06 '11 at 11:25
  • 1
    Oh, sorry, I've meant 500 mb :) I have only [ABC]. – Siarhei Fedartsou Mar 06 '11 at 11:38
  • For 500mb (10 million strings) the built-in Sort method should be good enough. – dtb Mar 06 '11 at 11:45
  • @dtb That if you have 500mb of ram :-) :-) – xanatos Mar 06 '11 at 11:55
  • 2
    I have made a simple benchmark. I created in RAM 250 million / 50 of C# random strings (so half the target of the OP). It took 670mb of memory (C# char is 2 bytes, plus the overhead of each string plus the overhead of the list). It took 12 secs on my slow laptop. To sort them in memory (Ordinal sorting) it took another 12 seconds. The sorting isn't the problem :-) – xanatos Mar 06 '11 at 12:10

5 Answers5

3

If I had to code that, I'd probably make one pass that split the input into many output files depending on the first couple of characters or so; the goal being to make each output file small enough to fit in main memory. Then I would open each file in order, sort it in memory, and append it to the output. First pass is O(n), second is more or less O(n log n), and you have to do disk I/O four times per record. It might be possible to do better with some arcane algorithm, but probably not by much, and this is easy to understand and code.

If the system limits how many files you can have open at once, you might have to split up the first pass. If the strings aren't well-distributed, some intermediate files might be too large.

In pseudocode:

open input file (r)
for i in ['aa', 'ab', 'ac', ..., 'zz']:
    open output file[i] (w)
for record in input file:
    write record to output file[record[0:2]]
close all files
open main output file (w)
for i in ['aa', 'ab', 'ac', ..., 'zz']:
    open input file[i] (r)
    slurp whole file into memory
    close input file
    sort data
    append whole sorted file to main output file

EDIT: Wait, do you mean the records only contain the characters A, B, and C? No other letters? In that case you would probably have to split on an initial substring longer than 2. Splitting on the first 3 characters would divide it into 27 files, each of size 370 MB on average.

Vamana
  • 580
  • 2
  • 7
  • Note: above solution is for the 500-GB problem. 500 MB is easy, one pass in RAM for any modern machine. – Vamana Mar 06 '11 at 12:32
2

Something like this?

List<string> list = new List<string>();
/* fill the list */
list.Sort();

The Sort() method has different overloads that allow you to customize the way the string comparison is performed.

EDIT Oh, by "big" you mean 500GB worth of strings then this probably isn't going to cut it.

kprobst
  • 16,165
  • 5
  • 32
  • 53
2

The algorithm you are looking for is probably the Merge Sort

http://en.wikipedia.org/wiki/Merge_sort

and this

http://en.wikipedia.org/wiki/External_sorting

BUT in your specific case, read this:

Need a way to sort a 100 GB log file by date

It could work for you!

Community
  • 1
  • 1
xanatos
  • 109,618
  • 12
  • 197
  • 280
  • I've already used External sorting, but my sorting is very slow(I use Array.Sort). – Siarhei Fedartsou Mar 06 '11 at 11:39
  • @miksayer What do you mean with "very slow"? Reading and writing 500 gb of data is inherently slow, you know? Platter-based disks are around 100mb/sec read or write for sequential reads/writes (I'm using this data http://www.maximumpc.com/article/reviews/western_digital_caviar_black_2tb). For a merge sort you'll need 2 reads and 2 writes, so 2 tb between reads and writes. Mmmmh.. 350 minutes? And I hope you aren't using a USB drive for the data :-) – xanatos Mar 06 '11 at 11:48
  • @miksayer As an addendum, you can speed up an external merge sort (nearly doubling its speed) if you have two disks, and one of the two is used only for the temporary files. Technically, to the 350 minutes you'll need to add the memory time for the sort, but I don't think sorting 500mb of strings is very long, if you do it correctly (using the StringComparison.Ordinal or StringComparison.OrdinalIgnoreCase) – xanatos Mar 06 '11 at 11:54
2

Since 500 MB is not a lot of data, you can simply load the entire file into memory, sort it, and write the result back to disk.

I assume that the file contents are laid out like this:

ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWX\r\n
ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWX\r\n
    :
    :
ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWX\r\n

Code:

// Load
var data = File.ReadAllBytes("file.txt");
var itemCount = data.Length / 52;
var indexes = new int[itemCount];
for (int i = 0; i < itemCount; i++)
{
    indexes[i] = i;
}

// Sort
Array.Sort<int>(indexes, (x, y) =>
{
    for (int i = 0; i < 50; i++)
    {
        if (data[x * 52 + i] > data[y * 52 + i]) return 1;
        if (data[x * 52 + i] < data[y * 52 + i]) return -1;
    }
    return 0;
});

// Store
using (var stream = new Stream("result.txt"))
{
    for (int i = 0; i < itemCount; i++)
    {
        stream.Write(data, indexes[i] * 52, 52);
    }
}
dtb
  • 213,145
  • 36
  • 401
  • 431
  • It's funny that, while this is probably the correct (and fastest) "response" to a (the/this) "homework" problem, it's so much far from a "real world" response to be, at least, as I've told, funny. A perfect example of the abstractness of the problem, probably :-) – xanatos Mar 06 '11 at 12:25
0

Quicksort (if used correctly) can be very effective in sorting strings.

The trick is to modify the partition method. The main idea is that at every partition step, the keys in a particular partition have the same prefix. When you partition again, you donot need to compare that prefix for the keys.

Example: Lets say the input is {"hello", "world", "house", "homly" } and the first partition is around the key "world"

You get: {"hello", "house", "homly"}, {"world"}

When you want to repartition the first set, you donot have to compare the first character of the strings as you already know that the first character is same in all of them !

In general, the length of the common prefix in a set would be the number of times partitioning was done to get to the set.

If you are interested to dive deep, do read Fast Algorithms for sorting and searching strings