0

I'm looking to sort lists of 1 billion to 100 billion elements on systems with 8-128 cores, RAM for 10% of the elements, and with disks delivering 100-1000 MBytes/s.

I have tested a simple merge sort, where each merge is performed in parallel by a CPU:

sorted_part_a:__
                \__[CPU.1]__
sorted_part_b:__/           \
                             \__[CPU.5]__
sorted_part_c:__             /           \
                \__[CPU.2]__/             \
sorted_part_d:__/                          \
                                            \__[CPU.7]
sorted_part_e:__                            /
                \__[CPU.3]__               /
sorted_part_f:__/           \             /
                             \__[CPU.6]__/
sorted_part_g:__             /
                \__[CPU.4]__/
sorted_part_h:__/

But that has the problem that the final merge step [CPU.7] has to do n comparisons on a single core when merging the last two inputs, and comparisons can be expensive (think strings that have to respect locale setting). In my test [CPU.7] was the bottleneck.

I have then looked into Red-Black-trees. They have several advantages:

  • when the tree is built, then getting a sorted list is O(n) with no comparisons. This avoids the bottleneck I saw in my merge sort test.
  • you can build trees in parallel and merge them in parallel thus using multiple cores.
  • you do not need all data before you can start building trees (so if you are reading from a slow device, you can sort while you are reading thus not wasting wall clock time).

Saving the tree to disk also seems pretty easy (simply export the sorted list and the height of the tree), but getting only part of the tree back from disk seems more tricky.

I have read Which parallel sorting algorithm has the best average case performance? but it seems to ignore the common case with medium sized data: That data fits on the server's disk, but it does not fit in RAM.

Given the hardware (8-128 cores, RAM for 10% of the elements, and with disks delivering 100-1000 MBytes/s streaming, 1000 iops) what is the fastest way to sort lists of 10^9 to 100 * 10^9 elements of 10-100 bytes each?

In layman's terms:
What is the tried and true way to fast sort the biggest amount of data that you would sort on a single server?

Ole Tange
  • 31,768
  • 5
  • 86
  • 104
  • For this sort of scenario, bucket sort works wonders as the first stage – Mooing Duck Apr 16 '20 at 22:06
  • Also, this is similar to a problem I'd been studing for a while, and came to an interesting revolution: Since the disk is _so_ much slower than CPU, your goal should be to make an algorithm that's IO bound. CPU is effectively free. Most likely, the optimal solution can be implemented in a single CPU core. (Then again, my hard drive doesn't do 1GB/s :O ) – Mooing Duck Apr 16 '20 at 22:09
  • 3
    https://en.wikipedia.org/wiki/External_sorting – user3386109 Apr 16 '20 at 22:12
  • At least as of Mar 2018, the fastest consumer disk was 480 MB/s. Super Fast SSDs are 235 MB/s. https://www.tomshardware.com/news/seagate-exos-hdd-hamr-mach.2,36719.html – Mooing Duck Apr 16 '20 at 22:12
  • `billion` long or short? – greybeard Apr 16 '20 at 22:15
  • 1
    @MooingDuck A RAID-0 array can easily get to 1 GB/S. – btilly Apr 16 '20 at 22:24
  • @greybeard Will your solution differ whether it is long or short? I think the best solution would cover both. – Ole Tange Apr 16 '20 at 23:29
  • @user3386109 External sorting will be a part of it, but disks are so fast today (think NVMe in RAID0) that they can easily overload a single core. But they cannot overload 128 cores. So it is an algorithm that fits in this area I am looking for. Additionally, NVMe's perform better if there are multiple accesses queued in parallel, and by using a parallel algorithm this can be leveraged, too. – Ole Tange Apr 16 '20 at 23:35
  • @MooingDuck 235 MB/s seems rather *slow* for SSDs... did you really mean that? – Kelly Bundy Apr 17 '20 at 01:25
  • @HeapOverflow: I doublecheckd my source, apparently that was for HDDs. My bad. – Mooing Duck Apr 17 '20 at 02:19
  • @OleTange Yes, my solution is different if there is data for twice RAM capacity, 30 times or *many* times. Same for sequential access secondary storage (tape - LTO & IBM TS are still around, native capacities in the low TB range as of 2020), "non-uniform block access" moving discs, and "uniform block access" NV SS. – greybeard Apr 17 '20 at 03:51
  • 2
    I think your objection to that final merge step is unfounded. My experience has been that the final merge step is output bound, even when string comparisons are involved. Should be easy enough to check. Load up a billion strings: half in one array and half in the other. Sort the two arrays. Then merge them to disk. See if you're CPU bound or I/O bound during the merge. – Jim Mischel Apr 17 '20 at 06:07
  • @JimMischel I have tested. It really is the bottleneck if the disks are fast or you can fit a bigger part in memory. It should be easy to convince yourself that it is true. Simply assume comparisons are ridiculously CPU expensive. Then you _really_ want to use all the CPUs you have paid for. – Ole Tange Apr 17 '20 at 07:26
  • @MooingDuck Can you elaborate on bucket sorting input that is designed by the devil (i.e. all your guesses are wrong and the devil will try to design input so they all fall into the same bucket)? – Ole Tange Apr 17 '20 at 07:47
  • @OleTange Where ultimate performance matters, the ***revised* Amdahl Law rules**. Would you mind to post the **`lcspu`** + **`hwloc`** / **`lstopo`** ( as in the https://stackoverflow.com/a/50221801 ) for the said System-under-Test ? **Knowing the reality as-is helps design the best performing strategy**, doesn't it? *( ... and indeed MANY THANKS FOR gnu PARALLEL --jobs 1 echo {} ::: cool coool coooool horse-power tool, Sir )* – user3666197 Apr 17 '20 at 10:50
  • The systems I am using today: https://gitlab.com/snippets/1967455 but I would prefer a solution that would work on different sized servers, too, which is why I give the hardware specs in the question as range of pretty ordinary servers. – Ole Tange Apr 17 '20 at 14:19
  • @OleTange: You're right that in pathalogical cases the bucket sort falls apart. You definitely need a merge-sort like Jim suggests as a fallback. Especially when comparisons are very expensive, on average a bucket sort will be _much_ faster for non-pathological cases, saving you log(numBuckets) comparisons per element. – Mooing Duck Apr 17 '20 at 15:57
  • @OleTange Thank you for the lscpu / lstopo details ( you may already know, that when **posting a comment without the *@-* signed text** StackOverflow site will **not** notify the intended user, so she / he may never find you have responded to the Questions in the comment ). Having this hardware, what is your actual performance target baseline to get an output from about those 1E11 items of about a 1E3 [B] each sorted? *Having no target, pretty any path fits & may lead you there.* – user3666197 Apr 17 '20 at 21:18
  • @user3666197 If you focus on my meager results, we will not be getting enough progress. The target should be: How do we have all CPUs run O(n/k *log n) comparisons if we are not limited by disk I/O, how do we optimize disk I/O when we are limited by disk I/O (which is typically by doing some but less parallel disk I/O), and can we do this without reading all data before starting. My meager results are on https://gitlab.com/ole.tange/tangetools/-/tree/master/parsort What I am really surprised by here is that there seems to be no well-known best practice. – Ole Tange Apr 17 '20 at 23:35

2 Answers2

1

I have never had to do this sort of thing when I didn't have custom built software to do the heavy lifting for me.

But the standard solution when I was at Google was to store your initial data in a distributed file system, do a distributed merge sort, and store the final data in a distributed file system. Since the final sorted data structure is stored in chunks, that means that even in the final pass each CPU only has to be doing comparisons within its chunk, allowing full CPU usage the whole way through.

For large data sets there is essentially never a use case where you want it in a single place at a single time where you have to iterate over the whole thing. To the contrary, imposing that arbitrary limitation just creates an unnecessary bottleneck.

btilly
  • 43,296
  • 3
  • 59
  • 88
  • How do you merge chunks to get a single sorted output? Are they not going through a single core that has to run a compare for each element? Or do you have a parallel algorithm for the final step? – Ole Tange Apr 16 '20 at 23:39
  • Assume that we are 1 byte short of using multiple servers and 1 byte short of using a distributed file system. So we are talking the max that you would do on a single server. – Ole Tange Apr 17 '20 at 00:10
  • 1
    @OleTange The chunks look like "A-G", "H-M", ... and are dynamically chosen once you have done enough merging that you have long enough files to want to start chunking things. They do not have to be exactly the same size. And because the ranges are disjoint, they are naturally parallelizable. – btilly Apr 17 '20 at 03:57
  • @OleTange This does require a mindset change. But once the mindset is changed, it is easy. – btilly Apr 17 '20 at 03:57
  • @btilly: I think this answer is _correct_, but it's very difficult to understand. – Mooing Duck Apr 17 '20 at 05:02
  • @MooingDuck Would it help to say that the "custom built software to do the heavy lifting" that I was referring to is described in https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf? – btilly Apr 17 '20 at 16:11
  • @btilly There is a good question in what is the most optimal way to sort data on large clusters, and your link to Google's MapReduce-article will make sense for that question. The question here, though, is: What is the fastest way of sorting the biggest amount of data that *you would sort on a single server*? – Ole Tange Apr 17 '20 at 23:40
  • @OleTange Parallelization is parallelization. The strategy that I described will let you use all CPUs the whole way through. – btilly Apr 18 '20 at 01:08
1

In the traditional merge, using sorted sub files, that final merge is O(n log k), where n is the total number of items and k is the number of sub files. Basically, you build a priority queue of the first items from each of the sorted sub files, remove the first item, write it out, and then insert the next item from the file that had the smallest item.

But you can parallelize that merge. Say you have 8 sub files. You can build a merge network like this:

    f1    f2    f3    f4    f5    f6    f7    f8
      \  /        \  /        \  /        \  /
       p1          p2          p3          p4
         \__    __/              \__    __/
            \  /                    \  /
             p5                      p6
                \_______    _______/
                        \  /
                         p7

The idea here being that each processor core p1 through p4 begins merging two files. Processors p5 and p6 each merge the output from two of the first-level processors, and p7 merges the results from them. p7 ends up doing n comparisons rather than the O(n log k) comparisons that it would have done if you were using a single CPU core for the merge.

Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
  • " that has the problem that the final merge step has to do n comparisons on a single core when merging the last two [subfiles]" – Mooing Duck Apr 17 '20 at 04:59
  • This also writes NlogN bytes to disk. Very inefficient. – Mooing Duck Apr 17 '20 at 04:59
  • @MooingDuck Yes, the final merge has to do n comparisons. Which is a whole lot fewer than n log k. It'll probably be output bound, anyway. And, no, it does not write n log n bytes to disk. The merge network is running in memory. The first level of processors is streaming from the files and merging into memory. The next levels read the memory buffers filled by the previous level. It's all done with producer/consumer queues. – Jim Mischel Apr 17 '20 at 05:03
  • Apologies, I didn't read the answer thoroughly and misunderstood the merge network. – Mooing Duck Apr 17 '20 at 05:22
  • @JimMischel You solution is the simple merge sort that I reject precisely because a single core has to do n comparisons. If the disk are fast or if most of the data fits in RAM then this _is_ the bottleneck - especially if comparisons are CPU expensive as suggested in the question (I have tested). Can we find a solution where one core does not have to do n comparisons? Something where k processors each have to do O(n/k*log n) comparisons? – Ole Tange Apr 17 '20 at 07:20