31

Java 8 provides java.util.Arrays.parallelSort, which sorts arrays in parallel using the fork-join framework. But there's no corresponding Collections.parallelSort for sorting lists.

I can use toArray, sort that array, and store the result back in my list, but that will temporarily increase memory usage, which if I'm using parallel sorting is already high because parallel sorting only pays off for huge lists. Instead of twice the memory (the list plus parallelSort's working memory), I'm using thrice (the list, the temporary array and parallelSort's working memory). (Arrays.parallelSort documentation says "The algorithm requires a working space no greater than the size of the original array".)

Memory usage aside, Collections.parallelSort would also be more convenient for what seems like a reasonably common operation. (I tend not to use arrays directly, so I'd certainly use it more often than Arrays.parallelSort.)

The library can test for RandomAccess to avoid trying to e.g. quicksort a linked list, so that can't a reason for a deliberate omission.

How can I sort a List in parallel without creating a temporary array?

Jeffrey Bosboom
  • 13,313
  • 16
  • 79
  • 92
  • 3
    All Java's sorting algorithms for `List` use stable sorting algorithms derived from mergesort, which do linear amounts of temporary allocation anyway. – Louis Wasserman Sep 21 '14 at 16:19
  • @LouisWasserman: Okay, but I'm still using thrice rather than twice the memory: the list, the toArray result and Arrays.parallelSort's working space (note that Arrays.parallelSort uses "a working space no greater than the size of the original array" -- linear temporary memory). When the lists are large (as required for parallel sorting to be a win), these constant factors begin to matter. (Also, Collections.parallelSort would be more convenient than using a temp array, too.) – Jeffrey Bosboom Sep 21 '14 at 16:25
  • 1
    An aside: The first line of `Collections.sort(List)` is `Object[] a = list.toArray()`: Every list is converted into an array, then the array is sorted, and the sorted array is written back to the list. It would not be so difficult to create a sort implementation that avoids this step, but details on that can be found in the answers that have been given so far. – Marco13 Sep 22 '14 at 15:03
  • 4
    Filed request for enhancement: [JDK-8059093](https://bugs.openjdk.java.net/browse/JDK-8059093). – Stuart Marks Sep 25 '14 at 00:37
  • 1
    @StuartMarks: You can add to that bug report that I'd be satisfied with adding ArrayList.parallelSort. (I don't have and can't get an account there.) – Jeffrey Bosboom Sep 25 '14 at 14:45

5 Answers5

23

There doesn't appear to be any straightforward way to sort a List in parallel in Java 8. I don't think this is fundamentally difficult; it looks more like an oversight to me.

The difficulty with a hypothetical Collections.parallelSort(list, cmp) is that the Collections implementation knows nothing about the list's implementation or its internal organization. This can be seen by examining the Java 7 implementation of Collections.sort(list, cmp). As you observed, it has to copy the list elements out to an array, sort them, and then copy them back into the list.

This is the big advantage of the List.sort(cmp) extension method over Collections.sort(list, cmp). It might seem that this is merely a small syntactic advantage being able to write myList.sort(cmp) instead of Collections.sort(myList, cmp). The difference is that myList.sort(cmp), being an interface extension method, can be overridden by the specific List implementation. For example, ArrayList.sort(cmp) sorts the list in-place using Arrays.sort() whereas the default implementation implements the old copyout-sort-copyback technique.

It should be possible to add a parallelSort extension method to the List interface that has similar semantics to List.sort but does the sorting in parallel. This would allow ArrayList to do a straightforward in-place sort using Arrays.parallelSort. (It's not entirely clear to me what the default implementation should do. It might still be worth it to do copyout-parallelSort-copyback.) Since this would be an API change, it can't happen until the next major release of Java SE.

As for a Java 8 solution, there are a couple workarounds, none very pretty (as is typical of workarounds). You could create your own array-based List implementation and override sort() to sort in parallel. Or you could subclass ArrayList, override sort(), grab the elementData array via reflection and call parallelSort() on it. Of course you could just write your own List implementation and provide a parallelSort() method, but the advantage of overriding List.sort() is that this works on the plain List interface and you don't have to modify all the code in your code base to use a different List subclass.

Stuart Marks
  • 127,867
  • 37
  • 205
  • 259
5

I think you are doomed to use a custom List implementation augmented with your own parallelSort or else change all your other code to store the big data in Array types.

This is the inherent problem with layers of abstract data types. They're meant to isolate the programmer from details of implementation. But when the details of implementation matter - as in the case of underlying storage model for sort - the otherwise splendid isolation leaves the programmer helpless.

The standard List sort documents provide an example. After the explanation that mergesort is used, they say

The default implementation obtains an array containing all elements in this list, sorts the array, and iterates over this list resetting each element from the corresponding position in the array. (This avoids the n2 log(n) performance that would result from attempting to sort a linked list in place.)

In other words, "since we don't know the underlying storage model for a List and couldn't touch it if we did, we make a copy organized in a known way." The parenthesized expression is based on the fact that the List "i'th element accessor" on a linked list is Omega(n), so the normal array mergesort implemented with it would be a disaster. In fact it's easy to implement mergesort efficiently on linked lists. The List implementer is just prevented from doing it.

A parallel sort on List has the same problem. The standard sequential sort fixes it with custom sorts in the concrete List implementations. The Java folks just haven't chosen to go there yet. Maybe in Java 9.

Gene
  • 46,253
  • 4
  • 58
  • 96
3

Use the following:

yourCollection.parallelStream().sorted().collect(Collectors.toList());

This will be parallel when sorting, because of parallelStream(). I believe this is what you mean by parallel sort?

bcsb1001
  • 2,834
  • 3
  • 24
  • 35
  • 3
    Under the covers (in java.util.stream.SortedOps.OfRef.opEvaluateParallel), this creates a temporary array and calls Arrays.parallelSort; in addition it's collecting into a new list, not the original list. Thus while it's concise, it's even worse in terms of memory usage (due to the new list) than just sorting a temporary array. – Jeffrey Bosboom Sep 21 '14 at 17:48
0

Just speculating here, but I see several good reasons for generic sort algorithms preferring to work on arrays instead of List instances:

  • Element access is performed via method calls. Despite all the optimizations JIT can apply, even for a list that implements RandomAccess, this probably means a lot of overhead compared to plain array accesses which can be optimized very well.
  • Many algorithms require copying some fragments of the array to temporary structures. There are efficient methods for copying arrays or their fragments. An arbitrary List instance on the other hand, can't be easily copied. New lists would have to be allocated which poses two problems. First, this means allocating some new objects which is likely more costly than allocating arrays. Second, the algorithm would have to choose what implementation of List should be allocated for this temporary structure. There are two obvious solutions, both bad: either just choose some hard-coded implementation, e.g. ArrayList, but then it could just allocate simple arrays as well (and if we're generating arrays then it's much easier if the soiurce is also an array). Or, let the user provide some list factory object, which makes the code much more complicated.
  • Related to the previous issue: there is no obvious way of copying a list into another due to how the API is designed. The best the List interface offers is addAll() method, but this is probably not efficient for most cases (think of pre-allocating the new list to its target size vs adding elements one by one which many implementations do).
  • Most lists that need to be sorted will be small enough for another copy to not be an issue.

So probably the designers thought of CPU efficiency and code simplicity most of all, and this is easily achieved when the API accepts arrays. Some languages, e.g. Scala, have sort methods that work directly on lists, but this comes at a cost and probably is less efficient than sorting arrays in many cases (or sometimes there will probably just be a conversion to and from array performed behind the scenes).

Michał Kosmulski
  • 9,855
  • 1
  • 32
  • 51
  • 1
    Re point 2/3: if the algorithm needs temporary space, it can just allocate arrays. There'd be no advantage to allocating a list, as the result will be stored in the original list. If it needs an internal copy of part of the list, it can use `subList(i, j).toArray()`. Storing the result back into the list (if not sorted in-place) does need to set each element one-by-one though, I'll give you that. – Jeffrey Bosboom Sep 21 '14 at 17:13
  • 1
    Re point 4: lists large enough for parallel sorting to be a win are large enough that another copy is an issue. Still, this answer seems to boil down to "lists are less performant and parallelSort is about performance", which is a reasonable answer. – Jeffrey Bosboom Sep 21 '14 at 17:15
0

By combining the existing answers I came up with this code.
This works if you are not interested in creating a custom List class and if you don't bother to create a temporary array (Collections.sort is doing it anyway).
This uses the initial list and does not create a new one as in the parallelStream solution.

// Convert List to Array so we can use Arrays.parallelSort rather than Collections.sort.
// Note that Collections.sort begins with this very same conversion, so we're not adding overhead
// in comparaison with Collections.sort.
Foo[] fooArr = fooLst.toArray(new Foo[0]);

// Multithread the TimSort. Automatically fallback to mono-thread when size is less than 8192.
Arrays.parallelSort(fooArr, Comparator.comparingStuff(Foo::yourmethod));

// Refill the List using the sorted Array, the same way Collections.sort does it.
ListIterator<Foo> i = fooLst.listIterator();
for (Foo e : fooArr) {
    i.next();
    i.set(e);
}
rdupz
  • 2,204
  • 2
  • 13
  • 21