25

Problem

I have an application where I want to sort an array a of elements a0, a1,...,an-1. I have a comparison function cmp(i,j) that compares elements ai and aj and a swap function swap(i,j), that swaps elements ai and aj of the array. In the application, execution of the cmp(i,j) function might be extremely expensive, to the point where one execution of cmp(i,j) takes longer than any other steps in the sort (except for other cmp(i,j) calls, of course) together. You may think of cmp(i,j) as a rather lengthy IO operation.

Please assume for the sake of this question that there is no way to make cmp(i,j) faster. Assume all optimizations that could possibly make cmp(i,j) faster have already been done.

Questions

  • Is there a sorting algorithm that minimizes the number of calls to cmp(i,j)?

  • It is possible in my application to write a predicate expensive(i,j) that is true iff a call to cmp(i,j) would take a long time. expensive(i,j) is cheap and expensive(i,j) ∧ expensive(j,k) → expensive(i,k) mostly holds in my current application. This is not guaranteed though.

    Would the existance of expensive(i,j) allow for a better algorithm that tries to avoid expensive comparing operations? If yes, can you point me to such an algorithm?

  • I'd like pointers to further material on this topic.

Example

This is an example that is not entirely unlike the application I have.

Consider a set of possibly large files. In this application the goal is to find duplicate files among them. This essentially boils down to sorting the files by some arbitrary criterium and then traversing them in order, outputting sequences of equal files that were encountered.

Of course reader in large amounts of data is expensive, therefor one can, for instance, only read the first megabyte of each file and calculate a hash function on this data. If the files compare equal, so do the hashes, but the reverse may not hold. Two large file could only differ in one byte near the end.

The implementation of expensive(i,j) in this case is simply a check whether the hashes are equal. If they are, an expensive deep comparison is neccessary.

Community
  • 1
  • 1
fuz
  • 88,405
  • 25
  • 200
  • 352
  • 3
    Radix sort is the only one I know of that is non-comparative, but you have to be able to treat your objects as integers: http://en.wikipedia.org/wiki/Radix_sort – IllusiveBrian Aug 22 '13 at 13:11
  • 3
    I would guess 1) pick one of the standard low-complexity sorts 2) cache the result of all comparisons. You might be able to use the expensive predicate to pick the pivot element better in a quicksort, say, but I doubt it's a huge gain. – Rup Aug 22 '13 at 13:12
  • Is there any particular structure to the *expensive(i, j)* relation? – Pascal Cuoq Aug 22 '13 at 13:15
  • @Pascal Yes. *expensive(i,j) ∧ expensive(j,k) → expensive(i,k)* – fuz Aug 22 '13 at 13:18
  • 3
    I think as Rup says, if you've got the space available, cache the results. Also, of course, compute any and all transitive implied results rather than querying for them (if you've already had to query to determine that `i – Damien_The_Unbeliever Aug 22 '13 at 13:19
  • @PascalCuoq Keep in mind that I am also interested in solutions where this property about *expensive(i,j)* does not hold. (I'm not even sure it does). – fuz Aug 22 '13 at 13:20
  • @Namfuak Neither radix sort nor any non-comparison based sort is applicable to my problem. – fuz Aug 22 '13 at 13:34
  • Then you may be stuck, as any comparison based sorting easily takes omega(n) comparisons, and that is a very optimistic lower bound – jamesSampica Aug 22 '13 at 13:39
  • @Jim Yes. Would it be possible to use the information provided by the *expensive(i,j)* function to avoid choosing expensive comparisons wherever possible? – fuz Aug 22 '13 at 13:40
  • @Fuz No, because such comparisons still need to be made, omega(n) of them – jamesSampica Aug 22 '13 at 13:45
  • @Jim So, even if just one possible comparison is expensive there have to be done *Ω(n)* expensive comparisons? I kind of don't believe you... – fuz Aug 22 '13 at 13:56
  • @Fux I misread what _expensive()_ accomplished, let me get back to you – jamesSampica Aug 22 '13 at 14:02
  • 1
    @FUZxxl: For your example, you could probably start by using cheaper criteria, eg. sort by size first, then by a (memoized) fast hash, then a (memoized) full hash, or possibly multiple hashes (ie. hashing blocks), which would allow you to stop at the first mismatch. also as Damien_The_Unbeliever says, caching the transitive results would likely save you extra comparisons. – Hasturkun Aug 22 '13 at 14:09
  • @Hasturkun I already do this. But regardless of what cheaper criteria you have, there will be a point where I have to do an expensive deep comparison, because a shortcut cannot catch all cases of differences. The question is about minimizing the need for such deep comparisons. – fuz Aug 22 '13 at 14:11
  • In the example with files, you may use a strong hash function (SHA512), sort your vector by hash, and then compare only those files with the same hash code. If the hash is 512 bit and it's strong, you should need at least 10^68 elements to reach a collision probability greater than 10^-18. Depending on the application, a failure rate of 10^-18 could be acceptable. – Giulio Franco Aug 22 '13 at 18:02
  • @Giulio This is a point if one assumes that no hash collisions are possible. Still, the question stands for cases where this is not applicable. – fuz Aug 22 '13 at 18:04
  • @FUZxxl I didn't assume they are not possible. They are possible. If they happen, the algorithm would not fail, but it would be slower. – Giulio Franco Aug 22 '13 at 18:06
  • @FUZxxl: No collisions are certainly possible. You never have to do a deep comparison if you use a cryptographically-secure hash like SHA256 or SHA512, as someone mentioned. Realize this: if you can find a collision, **you will have broken the security of the algorithm**! The problem of getting a collision with these is so low that it's practically zero. – user541686 Aug 22 '13 at 18:13
  • @FUZxxl: You could also try a randomized approach if you only want to hash a sub-portion of the files -- generate some random 1-MB intervals of the files and *hash the data in those*, instead of always hashing the first 1-MB interval. That way no one can exploit your program's worst-case behavior deterministically. – user541686 Aug 22 '13 at 18:18
  • @FUZxxl anyway, I'm trying to say that a solution significantly better than mergesort may not exist for the general question you asked, while it may exist for **your** problem. – Giulio Franco Aug 22 '13 at 18:28
  • A note on the use of Hash+size: SHA is a strong hash, which means it's impossible to obtain a collision by slightly modifying the input. So, if two files have the same hash, they will be significantly different, meaning their comparison will be fast. Moreover, in a perfect hash, the probability that two files of the same size have the same hash is zero, unless the files are at least 2^512 bytes long. SHA may not be perfect, but it's very unlikely to obtain a collision with two same-sized files of reasonable sizes (<2^64 B) – Giulio Franco Aug 23 '13 at 08:32
  • @Giulio Yes, I know that for the example application hashing works. I'm still interested in solutions where this is not applicable. (See edit to the original question: *Please assume for the sake of this question that there is no way to make cmp(i,j) faster. Assume all optimizations that could possibly make cmp(i,j) faster have already been done.* – fuz Aug 23 '13 at 09:52

9 Answers9

9

I'll try to answer each question as best as I can.

  • Is there a sorting algorithm that minimizes the number of calls to cmp(i,j)?

Traditional sorting methods may have some variation, but in general, there is a mathematical limit to the minimum number of comparisons necessary to sort a list, and most algorithms take advantage of that, since comparisons are often not inexpensive. You could try sorting by something else, or try using a shortcut that may be faster that may approximate the real solution.

  • Would the existance of expensive(i,j) allow for a better algorithm that tries to avoid expensive comparing operations? If yes, can you point me to such an algorithm?

I don't think you can get around the necessity of doing at least the minimum number of comparisons, but you may be able to change what you compare. If you can compare hashes or subsets of the data instead of the whole thing, that could certainly be helpful. Anything you can do to simplify the comparison operation will make a big difference, but without knowing specific details of the data, it's hard to suggest specific solutions.

  • I'd like pointers to further material on this topic.

Check these out:

Community
  • 1
  • 1
pattivacek
  • 5,617
  • 5
  • 48
  • 62
  • Oh yeah, that's a good idea. When I come back from my vacation I'm going to have a look at my copy of TAoCP vol. 3. – fuz Aug 22 '13 at 13:35
  • The example in your edited question is a great idea of a shortcut or alternate comparison operation. The sorting algorithm itself, though, is still limited to a certain number of comparison operations, but if you can change what is compared, that would be key. – pattivacek Aug 22 '13 at 13:53
  • The point is, whatever I do I don't get around shortcutting in some cases (notably, when two large files are almost equal). I can of course assume that iff the first megabytes are equal, the two files are equal, but that is certainly a wrong assumption. – fuz Aug 22 '13 at 13:57
  • That's true, you will still probably have to do the full, real comparison in some circumstances. But if you compare filesize, md5sum, first and last n bytes or some other metric, you may be able to get close to never needing the full comparison. Which metric will do that best really depends on your specific case with your specific data, but we can try to come up with ideas. – pattivacek Aug 22 '13 at 15:06
  • @partickvacek This is true. However, this is not the point of this question. This question concerns how to minimize the instances where expensive (aka deep) is actually done. – fuz Aug 22 '13 at 15:07
  • Now that I read your answer again, I found out that you where a bit mislead. *expensive(i,j)* is not a comparing function. It is a function that predicts whether a call to *cmp(i,j)* would be expensive. If a shortcut is possible, *cmp(i,j)* would take that shortcut. – fuz Aug 22 '13 at 15:33
  • Well, let me try to make sure we're on the same page. Mathematically, there must be a certain minimum number of comparisons. If we compare with something other than `cmp(i,h)` that is easier to compute, we can remove most calls to `cmp(i,h)`. We will still have to rely on `cmp(i,h)` for the cases where the simpler comparison method yields equivalence between two items. If that's the case, then that simpler comparison should be something that yields a usable result as often as possible. Whatever that ideal simpler comparison may be depends on your data. – pattivacek Aug 22 '13 at 15:42
  • Yes. This is the point I got to before I asked this question. Now back to the question: Can we tweak the sorting algorithm in a way, possibly using *expensive(i,j)* to predict the time needed to compare *ai* and *aj* that it tries to avoid doing "expensive" comparisons? For instance, if I ran a stock mergesort, it might compare *ai* and *aj* multiple times, even though I might know that such a comparison is expensive. – fuz Aug 22 '13 at 15:54
  • Most sorting algorithms will only compare the same two objects once. It would be redundant to repeat that comparison. If that is somehow a concern, you should just cache the result somewhere, as others have suggested. But you simply cannot reduce the number of comparisons below a certain mathematical limit. – pattivacek Aug 22 '13 at 16:08
  • I know that there is a lower boundary for the total comparisons needed. I don't fight that. That there is a lower boundary says nothing about which elements must be compared, just that the total number of comparisons will exceed a certain threshold. That there is a lower boundary does not exclude that there is a tricky way to almost exclusively do certain cheap comparisons while only rarely doing expensive ones. – fuz Aug 22 '13 at 16:21
8

The theoretical minimum number of comparisons needed to sort an array of n elements on average is lg (n!), which is about n lg n - n. There's no way to do better than this on average if you're using comparisons to order the elements.

Of the standard O(n log n) comparison-based sorting algorithms, mergesort makes the lowest number of comparisons (just about n lg n, compared with about 1.44 n lg n for quicksort and about n lg n + 2n for heapsort), so it might be a good algorithm to use as a starting point. Typically mergesort is slower than heapsort and quicksort, but that's usually under the assumption that comparisons are fast.

If you do use mergesort, I'd recommend using an adaptive variant of mergesort like natural mergesort so that if the data is mostly sorted, the number of comparisons is closer to linear.

There are a few other options available. If you know for a fact that the data is already mostly sorted, you could use insertion sort or a standard variation of heapsort to try to speed up the sorting. Alternatively, you could use mergesort but use an optimal sorting network as a base case when n is small. This might shave off enough comparisons to give you a noticeable performance boost.

Hope this helps!

Saeed Amiri
  • 22,252
  • 5
  • 45
  • 83
templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
  • 1
    On average? for comparision based sorting in best case is O(n log n) not on average. – Saeed Amiri Aug 24 '13 at 09:41
  • @SaeedAmiri- Are you sure about that? Comparison sorting algorithms can have best-case runtimes of O(n). Think about insertion sort on a sorted array, for example. The n log n barrier just says the average case can't be better than Omega(n log n), and many comparison sorts have worst-case runtime O(n log n) with O(n) best case runtimes. Take, for example, smoothsort. – templatetypedef Aug 24 '13 at 18:59
  • I see I wrote my comment in a wrong direction, sorry for that, (I went to write worst case is Omega(n log n), I do it in a wrong direction!), by the way my point was to say that barrier is not concern about average, is about running time, actually may be the average is important but more than average the running time is important, and the measure of running time is worst case (e.g we don't say insertion sort is O(n) we say is O(n^2)). Where did you see the barrier is about average? barrier is a pigeonhole principle which says there exists something, never talks about average. – Saeed Amiri Aug 24 '13 at 23:12
  • 2
    @SaeedAmiri- The standard proof of the sorting lower bound only talks about the worst-case runtime, but it's possible to modify it to talk about the average-case runtime as well by looking at the average length of all paths from the root to any leaf node. You can use the fact that the tree has height log (n!) to show that the average path length has to be at least log (n!) as well. One proof of this is given in this link, for example: http://www.cs.cmu.edu/~avrim/451f11/lectures/lect0913.pdf . Hope this helps! – templatetypedef Aug 25 '13 at 07:23
  • You could take a look into `Weak Heapsort`, with a worst case number of comparisons of `n log n + 0.1n`. See http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.46.2819&rep=rep1&type=pdf for details. – MicSim Aug 28 '13 at 15:22
4

A technique called the Schwartzian transform can be used to reduce any sorting problem to that of sorting integers. It requires you to apply a function f to each of your input items, where f(x) < f(y) if and only if x < y.


(Python-oriented answer, when I thought the question was tagged [python])

If you can define a function f such that f(x) < f(y) if and only if x < y, then you can sort using

sort(L, key=f)

Python guarantees that key is called at most once for each element of the iterable you are sorting. This provides support for the Schwartzian transform.

Python 3 does not support specifying a cmp function, only the key parameter. This page provides a way of easily converting any cmp function to a key function.

chepner
  • 497,756
  • 71
  • 530
  • 681
  • Somehow, I thought that this was tagged 'python'. However, the Schwartzian transform can be applied in any language. – chepner Aug 22 '13 at 13:27
  • This is not really applicable here (see example), abut still interesting to know. – fuz Aug 22 '13 at 13:54
  • You can use some advanced results from axiomatic set theory to prove that this isn't always possible. Some orderings can't be embedded inside integers or reals (for example, the lexicographical ordering of strings). – templatetypedef Aug 22 '13 at 16:55
  • Only if you have no bound on the length of the strings. – chepner Aug 22 '13 at 17:11
2

Is there a sorting algorithm that minimizes the number of calls to cmp(i,j)?

Edit: Ah, sorry. There are algorithms that minimize the number of comparisons (below), but not that I know of for specific elements.

Would the existence of expensive(i,j) allow for a better algorithm that tries to avoid expensive comparing operations? If yes, can you point me to such an algorithm?

Not that I know of, but perhaps you'll find it in these papers below.

I'd like pointers to further material on this topic.

On Optimal and Efficient in Place Merging

Stable Minimum Storage Merging by Symmetric Comparisons

Optimal Stable Merging (this one seems to be O(n log2 n) though

Practical In-Place Mergesort

If you implement any of them, posting them here might be useful for others too! :)

user541686
  • 205,094
  • 128
  • 528
  • 886
  • This only tackles a minimal comparison sorting algorithm, not whether the condition `expensive(i,j) ∧ expensive(j,k) → expensive(i,k)` can be used to avoid expensive comparisons – jamesSampica Aug 22 '13 at 18:08
1

Is there a sorting algorithm that minimizes the number of calls to cmp(i,j)?

Merge insertion algorithm, described in D. Knuth's "The art of computer programming", Vol 3, chapter 5.3.1, uses less comparisons than other comparison-based algorithms. But still it needs O(N log N) comparisons.

Would the existence of expensive(i,j) allow for a better algorithm that tries to avoid expensive comparing operations? If yes, can you point me to such an algorithm?

I think some of existing sorting algorithms may be modified to take into account expensive(i,j) predicate. Let's take the simplest of them - insertion sort. One of its variants, named in Wikipedia as binary insertion sort, uses only O(N log N) comparisons.

It employs a binary search to determine the correct location to insert new elements. We could apply expensive(i,j) predicate after each binary search step to determine if it is cheap to compare the inserted element with "middle" element found in binary search step. If it is expensive we could try the "middle" element's neighbors, then their neighbors, etc. If no cheap comparisons could be found we just return to the "middle" element and perform expensive comparison.

There are several possible optimizations. If predicate and/or cheap comparisons are not so cheap we could roll back to the "middle" element earlier than all other possibilities are tried. Also if move operations cannot be considered as very cheap, we could use some order statistics data structure (like Indexable skiplist) do reduce insertion cost to O(N log N).

This modified insertion sort needs O(N log N) time for data movement, O(N2) predicate computations and cheap comparisons and O(N log N) expensive comparisons in the worst case. But more likely there would be only O(N log N) predicates and cheap comparisons and O(1) expensive comparisons.

Consider a set of possibly large files. In this application the goal is to find duplicate files among them.

If the only goal is to find duplicates, I think sorting (at least comparison sorting) is not necessary. You could just distribute the files between buckets depending on hash value computed for first megabyte of data from each file. If there are more than one file in some bucket, take other 10, 100, 1000, ... megabytes. If still more than one file in some bucket, compare them byte-by-byte. Actually this procedure is similar to radix sort.

Evgeny Kluev
  • 24,287
  • 7
  • 55
  • 98
0

Quicksort and mergesort are the fastest possible sorting algorithm, unless you have some additional information about the elements you want to sort. They will need O(n log(n)) comparisons, where n is the size of your array. It is mathematically proved that any generic sorting algorithm cannot be more efficient than that.

If you want to make the procedure faster, you might consider adding some metadata to accelerate the computation (can't be more precise unless you are, too).

If you know something stronger, such as the existence of a maximum and a minimum, you can use faster sorting algorithms, such as radix sort or bucket sort.

You can look for all the mentioned algorithms on wikipedia.

As far as I know, you can't benefit from the expensive relationship. Even if you know that, you still need to perform such comparisons. As I said, you'd better try and cache some results.


EDIT

I took some time to think about it, and I came up with a slightly customized solution, that I think will make the minimum possible amount of expensive comparisons, but totally disregards the overall number of comparisons. It will make at most (n-m)*log(k) expensive comparisons, where
  • n is the size of the input vector
  • m is the number of distinct component which are easy to compare between each other
  • k is the maximum number of elements which are hard to compare and have consecutive ranks.

Here is the description of the algorithm. It's worth nothing saying that it will perform much worse than a simple merge sort, unless m is big and k is little. The total running time is O[n^4 + E(n-m)log(k)], where E is the cost of an expensive comparison (I assumed E >> n, to prevent it from being wiped out from the asymptotic notation. That n^4 can probably be further reduced, at least in the mean case.

EDIT

The file I posted contained some errors. While trying it, I also fixed them (I overlooked the pseudocode for insert_sorted function, but the idea was correct. I made a Java program that sorts a vector of integers, with delays added as you described. Even if I was skeptical, it actually does better than mergesort, if the delay is significant (I used 1s delay agains integer comparison, which usually takes nanoseconds to execute)
Giulio Franco
  • 3,170
  • 15
  • 18
  • _generic sorting algorithm cannot be more efficient than that._ Yes, for comparison based sorting. Someone has already mentioned Radix sorting, which is linear time. Though maybe that's what you meant by _generic_? – jamesSampica Aug 22 '13 at 13:26
  • I might be wrong but it is O(n log(n)) on average right? It can be worse: O(n^2) or better O(n). There might be a way to optimize that depending on your needs. for quicksort – dyesdyes Aug 22 '13 at 13:36
  • 1
    @dyesdyes you can't _optimize_ it to be better than nlog(n) with non-deterministic input. – jamesSampica Aug 22 '13 at 13:51
  • 3
    The argument "Quicksort is the fastest..." is not quite applicable. First, Quicksort has some pathologic cases with an execution time of *O(n²)* so it might not be the best one, and second I wouldn't care if the algorithm took *O(n³)* time if it took less expensive comparisons than a Quicksort. That's the point of the question. – fuz Aug 22 '13 at 13:52
  • @Jim exactly. I also mentioned radix/bucket sorting in my answer. – Giulio Franco Aug 22 '13 at 14:41
  • @FUZxxl I think it can be proved that the pathologic cases are rare, if the input array is random, and that they can be further avoided with random pivot selection. Anyway, I also said that I never heard of sorting that can benefit from the `expensive` function you mentioned, because it gives you no information at all about the relative ordering of two elements. In the worst case, all comparisons could be expensive, and it would take O(n^2) only to realize that (maybe less, since expensiveness is transitive) – Giulio Franco Aug 22 '13 at 14:45
  • For big values of n, O(log n) will be better than O(n^3), no matter how complex your comparison is (as long as it's not a function of n). As I said, you need more certainties to draw some conclusions (such as "the array will never be bigger than K") – Giulio Franco Aug 22 '13 at 14:48
  • @Giulio Franco You may assume, as I stated in the question, that the time needed to perform *one* computation dwarves the entire rest of the sorting time for the amount of data I have. Asymptoptic times are great, but sometimes O(1) is slower than O(log n) for every reasonable set of data if that O(1) is very large. – fuz Aug 22 '13 at 15:57
  • @FUZxxl By definition of asyntotic time, it exitst a constant C, so that, for any n > C, the O(1) algorithm will be faster than the the O(log n) algorithm. However, you answered me, by stating that for any reasonable dataset, this will not happen. So, you're back to the beginning: you need to simplfy your comparison. If the comparison time is so big, you may consider performing a O(n) preprocessing, to map each element to a set of isomorphic integers. This might still be expensive, but a sequential sorting algorithm that performs less than O(n) comparisons cannot exist at all. – Giulio Franco Aug 22 '13 at 16:10
  • 2
    @Giulio Said C might be so large that it doesn't matter for the sake of the actual problem. Exactly this is the case here. – fuz Aug 22 '13 at 16:13
  • 1
    Quicksort and mergesort are **not** optimal comparison sorting algorithms. They're asymptotically optimal, but that doesn't mean they provably minimize the number of comparisons made. – templatetypedef Aug 22 '13 at 16:52
0

Most sorting algorithm out there try minimize the amount of comparisons during sorting.

My advice: Pick quick-sort as a base algorithm and memorize results of comparisons just in case you happen to compare the same problems again. This should help you in the O(N^2) worst case of quick-sort. Bear in mind that this will make you use O(N^2) memory.

Now if you are really adventurous you could try the Dual-Pivot quick-sort.

le-doude
  • 3,345
  • 2
  • 25
  • 55
0

Something to keep in mind is that if you are continuously sorting the list with new additions, and the comparison between two elements is guaranteed to never change, you can memoize the comparison operation which will lead to a performance increase. In most cases this won't be applicable, unfortunately.

Hderms
  • 170
  • 1
0

We can look at your problem in the another direction, Seems your problem is IO related, then you can use advantage of parallel sorting algorithms, In fact you can run many many threads to run comparison on files, then sort them by one of a best known parallel algorithms like Sample sort algorithm.

Saeed Amiri
  • 22,252
  • 5
  • 45
  • 83
  • 1
    Cool. So you mean my HDD will spin faster when I try to access two files at once? Please notice what I wrote in the question: *Please assume for the sake of this question that there is no way to make cmp(i,j) faster. Assume all optimizations that could possibly make cmp(i,j) faster have already been done.* – fuz Aug 23 '13 at 10:33
  • @FUZxxl, Cool man I didn't think that you can do it faster in your HDD where did you see this in my answer? It's clear that there are few sorting algorithm with less possible comparision, but when you load data from IO to the memory, is much more time consuming than comparing two big data in memory. So best thing you can do is paralleling fetching data. I did not try to improve your comparision method at all, and explicitly I mentioned that "Seems your problem is IO related". I'm pretty sure that you didn't try palatalization, otherwise you were not going to be such a cool man. – Saeed Amiri Aug 23 '13 at 11:42
  • If the amount of data was actually small enough to fit into the machine's RAM, I wouldn't have these performance problems anyway. As I wrote above, the time taken to actually *sort* is negligible. The cases where *cmp(i,j)* is expensive are exactly those where I actually have to read lots of data from the hard disk (which, as stated in the question, I cannot optimize away). Now, tell me how parallelization is exactly gonna help me. – fuz Aug 23 '13 at 13:52
  • @FUZxxl, Right now you doing serial actions, you read the data for example by bulk read or something, then you compare them, then you read the next part and so on. but the problem is during you process data in cpu, you could also fetch another data. – Saeed Amiri Aug 23 '13 at 14:11
  • Processing the data takes, maybe 0,1% of the time. Reading other data in the meanwhile wouldn't bring very much. – fuz Aug 23 '13 at 14:35
  • @FUZxxl, So this improves the running time 0.1% and I don't think that you can improve better than 0.1% if you change for example in-place merge sort to some sorting with less comparison (which is quite hard job if is there any). All you should do is to think about the way of fetching, or possibly creative randomized sort algorithms (combine it with parallelization) such that you don't need to fetch whole the file, but sort it with high probability in good format. – Saeed Amiri Aug 23 '13 at 14:39