I'm looking to sort lists of 1 billion to 100 billion elements on systems with 8-128 cores, RAM for 10% of the elements, and with disks delivering 100-1000 MBytes/s.
I have tested a simple merge sort, where each merge is performed in parallel by a CPU:
sorted_part_a:__
\__[CPU.1]__
sorted_part_b:__/ \
\__[CPU.5]__
sorted_part_c:__ / \
\__[CPU.2]__/ \
sorted_part_d:__/ \
\__[CPU.7]
sorted_part_e:__ /
\__[CPU.3]__ /
sorted_part_f:__/ \ /
\__[CPU.6]__/
sorted_part_g:__ /
\__[CPU.4]__/
sorted_part_h:__/
But that has the problem that the final merge step [CPU.7
] has to do n comparisons on a single core when merging the last two inputs, and comparisons can be expensive (think strings that have to respect locale setting). In my test [CPU.7
] was the bottleneck.
I have then looked into Red-Black-trees. They have several advantages:
- when the tree is built, then getting a sorted list is
O(n)
with no comparisons. This avoids the bottleneck I saw in my merge sort test. - you can build trees in parallel and merge them in parallel thus using multiple cores.
- you do not need all data before you can start building trees (so if you are reading from a slow device, you can sort while you are reading thus not wasting wall clock time).
Saving the tree to disk also seems pretty easy (simply export the sorted list and the height of the tree), but getting only part of the tree back from disk seems more tricky.
I have read Which parallel sorting algorithm has the best average case performance? but it seems to ignore the common case with medium sized data: That data fits on the server's disk, but it does not fit in RAM.
Given the hardware (8-128 cores, RAM for 10% of the elements, and with disks delivering 100-1000 MBytes/s streaming, 1000 iops) what is the fastest way to sort lists of 10^9 to 100 * 10^9 elements of 10-100 bytes each?
In layman's terms:
What is the tried and true way to fast sort the biggest amount of data that you would sort on a single server?