143

Sorting takes O(n log n) in the serial case. If we have O(n) processors we would hope for a linear speedup. O(log n) parallel algorithms exist but they have a very high constant. They also aren't applicable on commodity hardware which doesn't have anywhere near O(n) processors. With p processors, reasonable algorithms should take O(n/p log n) time.

In the serial case, quick sort has the best runtime complexity on average. A parallel quick sort algorithm is easy to implement (see here and here). However it doesn't perform well since the very first step is to partition the whole collection on a single core. I have found information on many parallel sort algorithms but so far I have not seen anything pointing to a clear winner.

I'm looking to sort lists of 1 million to 100 million elements in a JVM language running on 8 to 32 cores.

Community
  • 1
  • 1
Craig P. Motlin
  • 26,452
  • 17
  • 99
  • 126
  • 1
    I think you have one too many n/p in your "should take" – Sparr Sep 17 '12 at 14:16
  • @Sparr I don't think so. I'm making a distinction between having a few processors and having as many processors as elements being sorted. – Craig P. Motlin Sep 17 '12 at 15:07
  • @CraigP.Motlin right, but you seem to have "distributed" the /p erroneously. There should be only one /p. – Sparr Sep 18 '12 at 16:23
  • 1
    @CraigP.Motlin I think you kept the wrong one :) – Sparr Sep 20 '12 at 19:56
  • When sorting very large lists I would suggest that best worst case performance would be a better option. If avergae is O(n log n) and worst case is O(n^2) then you may have a problem! – redcalx Sep 22 '12 at 09:07
  • @CraigP.Motlin, the question about the kind of elements is important to determine what kind of operations can be performed on them. Because if the items are integer numbers, then radix sort can be applied which is `O(n)` for a sequential algorithm, and the constant is small when the bit length of the numbers is small (e.g. 16-bit numbers). Something similar applies for strings. An inapplicable case is when elements e.g. can only be pairwise-compared. There is also vectorization opportunities like AVX to consider: though you may not be able to access it from Java. – Serge Rogatch Aug 19 '17 at 19:38

4 Answers4

218

The following article (PDF download) is a comparative study of parallel sorting algorithms on various architectures:

Parallel sorting algorithms on various architectures

According to the article, sample sort seems to be best on many parallel architecture types.

Update to address Mark's concern of age:

Here are more recent articles introducing something more novel (from 2007, which, btw, still get compared with sample sort):

Improvements on sample sort
AA-Sort

The bleeding edge (circa 2010, some only a couple months old):

Parallel sorting pattern
Many-core GPU based parallel sorting
Hybrid CPU/GPU parallel sort
Randomized Parallel Sorting Algorithm with an Experimental Study
Highly scalable parallel sorting
Sorting N-Elements Using Natural Order: A New Adaptive Sorting Approach

Update for 2013: Here is the bleeding edge circa January, 2013. (Note: A few of the links are to papers at Citeseer and require registration which is free):

University lectures:
Parallel Partitioning for Selection and Sorting
Parallel Sorting Algorithms Lecture
Parallel Sorting Algorithms Lecture 2
Parallel Sorting Algorithms Lecture 3

Other sources and papers:
A novel sorting algorithm for many-core architectures based on adaptive bitonic sort
Highly Scalable Parallel Sorting 2
Parallel Merging
Parallel Merging 2
Parallel Self-Sorting System for Objects
Performance Comparison of Sequential Quick Sort and Parallel Quick Sort Algorithms
Shared Memory, Message Passing, and Hybrid Merge Sorts for Standalone and Clustered SMPs
Various parallel algorithms (sorting et al) including implementations

GPU and CPU/GPU hybrid sources and papers:
An OpenCL Method of Parallel Sorting Algorithms for GPU Architecture
Data Sorting Using Graphics Processing Units
Efficient Algorithms for Sorting on GPUs
Designing efficient sorting algorithms for manycore GPUs
Deterministic Sample Sort For GPUs
Fast in-place sorting with CUDA based on bitonic sort
Fast parallel GPU-sorting using a hybrid algorithm
Fast Parallel Sorting Algorithms on GPUs
Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort
GPU sample sort
GPU-ABiSort: Optimal Parallel Sorting on Stream Architectures
GPUTeraSort: high performance graphics co-processor sorting for large database management
High performance comparison-based sorting algorithm on many-core GPUs
Parallel external sorting for CUDA-enabled GPUs with load balancing and low transfer overhead
Sorting on GPUs for large scale datasets: A thorough comparison

Update for 2022: I have not forgotten this answer and like all things computer related, it has not aged well. I will do my best to update and refresh it for current trends as well as the state of the art, at some point prior to the end of this year (2022). If you have interest in this topic and would like to see the update sooner, please either reply to or better yet, upvote the comment I made below this answer, so that I can gauge interest in this topic over others that also are in need of an update.

Michael Goldshteyn
  • 71,784
  • 24
  • 131
  • 181
  • 2
    It's a comparative study of parallel sorting algorithms on various architectures current in 1996. A lot has changed in parallel computing since then. – High Performance Mark Oct 19 '10 at 15:15
  • 1
    It seems you missed what's IMHO the best of all, Efficient Implementation of Sorting in Multi-core SIMD architecture. From Intel research, presented at VLDB 2008. – alecco Oct 11 '14 at 18:27
  • 4
    This would have been a great answer, once. Now, most of the links are broken. – Tim Long Apr 14 '19 at 00:46
  • @MichaelGoldshteyn would greatly appreciate any update. Thanks! – geenweb Apr 12 '22 at 17:19
  • 13
    @geenweb , I do have an update planned for this year (no definite date, but I have not forgotten about this answer). If you are interested in an update to this list for 2022, please upvote this comment, so that I can gauge the level of interest on SO and prioritize this item over others I also maintain. – Michael Goldshteyn Jun 16 '22 at 15:07
  • many of the links in your answer are not working in 2022, can you please update them or mention the full title instead? – user2348209 Oct 03 '22 at 15:33
  • That is a pure gold answer. well done sir :) – StudentAccount4 Feb 11 '23 at 07:53
  • If an update is given, I would like to see a synthesis (or a link to a survey article) that would give a better explanation of why some sorting algorithm fares better and under which assumptions. It is not clear whether these linked papers assumptions fall into the OP's parallelism conditions (8-32 cores). – Dimitri Lesnoff May 16 '23 at 09:29
7

I have worked with both a Parallel Quicksort algorithm and a PSRS algorithm that essentially combines quicksort in parallel with merging.

With the Parallel Quicksort algorithm, I have demonstrated near linear speedup with up to 4 cores (dual core with hyper-threading), which is expected given the limitations of the algorithm. A pure Parallel Quicksort relies on a shared stack resource which will result in contention among threads, thus reducing any gain in performance. The advantage of this algorithm is that it sorts 'in-place,' which reduces the amount of memory needed. You may want to consider this when sorting upwards of 100M elements as you stated.

I see you are looking to sort on a system with 8-32 cores. The PSRS algorithm avoids contention at the shared resource, allowing speedup at higher numbers of processes. I have demonstrated the algorithm with up to 4 cores as above, but experimental results of others report near linear speedup with much larger numbers of core, 32 and beyond. The disadvantage of the PSRS algorithm is that it is not in-place and will require considerably more memory.

If you are interested, you may use or peruse my Java code for each of these algorithms. You can find it on github: https://github.com/broadbear/sort. The code is intended as a drop-in replacement of Java Collections.sort(). If you are looking for the ability to perform parallel sorting in a JVM as you state above, the code in my repo may help you out. The API is fully genericized for elements implementing Comparable or implementing your own Comparator.

May I ask what you are looking to sort that many elements for? I'm interested to know of potential applications for my sorting package.

broadbear
  • 614
  • 8
  • 19
  • I got an 8 core processor. :) Now I have tested sorting upwards of 40M elements. I am not seeing linear speedup, but I am seeing substantial performance gain over the standard Java 8 Collections sort algorithm, which is supposedly a multi-threadd Timsort. My PSRS implementation sorts 40M elements in an average of 4985 ms, compared with 19759 ms for the default JDK sort algorithm. – broadbear Aug 05 '16 at 05:29
4

Take a look at this paper: A Scalable Parallel Sorting Algorithm Using Exact Splitting. It is concerned with many more than 32 cores. However, it describes in detail an algorithm, which has a running time complexity of O(n/p * log(n) + p * log(n)**2) and is applicable for arbitrary comparators.

haraldkl
  • 3,809
  • 26
  • 44
2

The paper "Comparison of Parallel Sorting Algorithms on Different Architectures" may be a good place for you to start.

Soleil
  • 6,404
  • 5
  • 41
  • 61
LBushkin
  • 129,300
  • 32
  • 216
  • 265