0

I have a large file of strings that I need sorted and I'm wondering if someone can tell me which sort would be most efficient? I read about Quicksort and apparently it's more for primitive types, whereas Mergesort would work with Strings. Oh, and there's insertion sort, but for small arrays. I'm not really aware of other sorting algorithms.

. What is the benefit/downside to recursion?

. What is the benefit/downside of iteration?

I only have 4GB of memory to work with, so obviously I cannot load my file into memory.

Thanks.

dimo414
  • 47,227
  • 18
  • 148
  • 244
user817129
  • 522
  • 3
  • 10
  • 19
  • 1
    http://stackoverflow.com/questions/7918060/how-do-i-sort-very-large-files – skozlov Jul 16 '15 at 04:44
  • You'll be hard pressed to do better than [`sort`](https://www.google.com/search?q=man%20sort) - just use that. – dimo414 Jul 16 '15 at 04:52
  • Search Tera Sort Algorithm – Ankur Anand Jul 16 '15 at 05:04
  • With the amount of data _that_ close to available memory, looking for a more compact representation of that data might turn this from a problem calling for an "external sort" into one for your favourite internal sort method: with strings, this is more probable to work for largish ones than for strings of just a few characters: tell us more about your `String`s. – greybeard Jul 16 '15 at 05:15

3 Answers3

1

As 11GB into 4GB doesn't go, you will have to use an external sort. This will consist of a replacement-selection distribution phase and a polyphase or cascade or balanced merge phase. Nothing to do with either Quicksort or merge-sort in either case. If you have sort(1) available, use that. Or a database.

user207421
  • 305,947
  • 44
  • 307
  • 483
  • 1
    Could you explain what those phases do? – marstran Jul 16 '15 at 05:28
  • @marstran The distribution phase creates multiple short sorted files from the input. The merge phase merges them repeatedly together into longer and longer files until there is only one left. Most of the time is spent merging, not distributing, so it is customary to use replacement-selection for the distribution phase rather than a Quicksort, which only produces files half the length on average, which has far more catastrophic effect on merge time than any time it saves in sorting. – user207421 Dec 31 '17 at 04:41
1

What you want to do is External Sort.

Example (From Wikipedia):

For sorting 900 megabytes of data using only 100 megabytes of RAM:

Read 100 MB of the data in main memory and sort by some conventional method, like quicksort.

Write the sorted data to disk.

Repeat steps 1 and 2 until all of the data is in sorted 100 MB chunks (there are 900MB / 100MB = 9 chunks), which now need to be merged into one single output file.

Read the first 10 MB (= 100MB / (9 chunks + 1)) of each sorted chunk into input buffers in main memory and allocate the remaining 10 MB for an output buffer. (In practice, it might provide better performance to make the output buffer larger and the input buffers slightly smaller.)

Perform a 9-way merge and store the result in the output buffer. Whenever the output buffer fills, write it to the final sorted file and empty it. Whenever any of the 9 input buffers empties, fill it with the next 10 MB of its associated 100 MB sorted chunk until no more data from the chunk is available. This is the key step that makes external merge sort work externally -- because the merge algorithm only makes one pass sequentially through each of the chunks, each chunk does not have to be loaded completely; rather, sequential parts of the chunk can be loaded as needed.

Karthik
  • 4,950
  • 6
  • 35
  • 65
  • 1
    That is a very poor quality Wikipedia article. Quicksort has no place in external sorting. It doesn't even mention replacement-selection distribution, or dummy runs, let alone polyphase or cascade or balanced merging. Always remember [Wikipedia is not a reliable resource](https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_is_not_a_reliable_source) – user207421 Jul 16 '15 at 05:13
  • With RAM on the order of one fifth of the data to sort, forming initial runs using a priority queue can be expected to yield three runs - a 9-way or polyphase merge seems oversized. – greybeard Jul 16 '15 at 05:18
  • @EJP Yes I agree that wikipedia is always not reliable. But in this particular article, I thinkWhen you sort a chunk of 100MB, I personally think it will not any difference whether you use merge sort or quick sort. The important point is to do is n-way merge, I think that is mentioned properly. – Karthik Jul 16 '15 at 05:20
  • @greybeard Exactly so. You would end up with a single three-way merge. – user207421 Jul 16 '15 at 05:22
  • @greybeard I think it's just written to get an Idea of how it works, even I personally think after sorting individual chunks using a heap to sort the whole data instead of doing a n-way merge is better. – Karthik Jul 16 '15 at 05:22
  • 1
    @karthik I fixed Wikipedia. Very strange considering it already cites Knuth. Did the author actually read it? – user207421 Jul 16 '15 at 05:22
  • @Karthik You are welcome to think whatever you like about whether it will make any difference, but the acid test is performance. Commercial sort-merge packages have always used replacement-selection distribution phases, and aren't written for a single special case of 100MB. – user207421 Dec 31 '17 at 04:44
  • 1
    As the mentioned Wiki article still mentions that each run (block which fits in RAM) may be ordered using any sorting algorithm eg quicksort, I infer that @user207421 's criticism was unfounded. BTW I don't even know what "ACID test is performance" should mean; ACID is not about performance, but reliability. – Alan Evangelista Nov 15 '19 at 09:07
-4

The Arrays.sort() methods use a quick sort in all primitive type arrays. The Collections.sort() method uses a merge sort. This sort is also used in Arrays.sort(Object[]) and Arrays.sort(T[], Comparator).

user3181531
  • 9
  • 1
  • 2
  • OP needs to sort 11GB with 4GB RAM. `Arrays.sort()` and `Collections.sort()` will not work. – dimo414 Jul 16 '15 at 04:57
  • 1
    Have you read the question? Besides, it's just false. It depends on array size, but currently TimSort is the preferred algorithm. – Tagir Valeev Jul 16 '15 at 04:58
  • @TagirValeev [Java uses TimSort](https://en.wikipedia.org/wiki/Timsort) for non-primitive sorts. Even if it didn't, TimSort only provides a marginal speed increase over merge / quick sort, it doesn't use less space nor is it asymptotically faster. – dimo414 Jul 16 '15 at 05:00