sorting Lists after groupingBy

Question

I am wondering, if there is already an implemented feature in streams (or Collectors) which has sorted Lists as values. E.g. the following codes both produce gender-grouped Lists of persons, sorted by age. The first solution has some overhead sorting (and looks a little bit scruffy). The second solution needs to look at every person twice but does the job in a pretty way.

First sorting then grouping in one stream:

Map<Gender, List<Person>> sortedListsByGender = (List<Person>) roster
        .stream()
        .sorted(Person::compareByAge)
        .collect(Collectors.groupingBy(Person::getGender));

First grouping, then sorting every value:

Map<Gender, List<Person>> sortedListsByGender = (List<Person>) roster
        .stream()
        .collect(Collectors.groupingBy(Person::getGender));
sortedListsByGender.values()
        .forEach(list -> Collections.sort(list, Person::compareByAge));

I am just wondering, if there is already something implemented, which does this in one run, like groupingBySorted.

Nope, there isn't. (That said -- why do you think it wouldn't work in parallel streams? Sure it would.) — Louis Wasserman, Mar 08 '16 at 16:11
@LouisWasserman Are you sure about that (the second part of your comment)? Looking at the docs of `sorted` and `collect` I have a strong hunch this will break with parallel streaming, but I can be completely wrong here. Feel free, to correct me in my question, wouldn't want to spread wrong information. — ctst, Mar 08 '16 at 16:17
Yes, yes I am. Parallelism doesn't interfere with the ordering property of a stream. — Louis Wasserman, Mar 08 '16 at 16:17
Probably because there are no built-in `SortedList`: http://stackoverflow.com/q/8725387/1743880 — Tunaki, Mar 08 '16 at 16:35
You can slightly compact the second code snippet using `Collectors.collectingAndThen(groupingBy(Person::getGender), l -> {Collections.sort(l, Person::compareByAge); return l;});`, which may be worth it if you store and reuse the resulting Collector in many places. But if you want to sort while collecting, you'll need to write it yourself. This kind of makes sense, because it's probably faster to build an unsorted list and quick/merge sort at the end, rather than maintaining a sorted list during collection by inserting each element in its place (insertion sort). — Jeffrey Bosboom, Mar 08 '16 at 22:06
@JeffreyBosboom, note that [by specification](https://docs.oracle.com/javase/8/docs/api/java/util/stream/Collectors.html#groupingBy-java.util.function.Function-), `groupingBy` collector does not guarantee that Lists created inside map are mutable. Thus theoretically your code might break in future Java versions. It would be more correct to group via `Collectors.groupingBy(Person::getGender, Collectors.toCollection(ArrayList::new))`. — Tagir Valeev, Mar 09 '16 at 04:03
@TagirValeev Good catch. Switching collectors to a linked-list-of-segments implementation would allow for better merging, but if the developers are serious about doing that, they should have returned an immutable list in the first place, because lots of code will accidentally depend on the mutability. Given Java's preference for backwards compatibility, it will be interesting to see if they're willing to break things. (Really, the problem is that mutability should be part of the type, but that ship sailed in Java 1.3.) — Jeffrey Bosboom, Mar 09 '16 at 04:12

Holger · Accepted Answer · 2016-03-10T10:26:00.563

When using sorted(comparator) on the stream before the collect operation, the stream has to buffer the entire stream contents to be able to sort it and the sorting may involve much more data movement within that buffer, compared to sorting the smaller lists of the groups afterwards. So the performance is not as good as sorting the individual groups though the implementation will utilize multiple cores if parallel processing has been enabled.

But note that using sortedListsByGender.values().forEach(…) is not a parallelizable operation and even using sortedListsByGender.values().parallelStream().forEach(…) would only allow parallel processing of groups while each sort operation still is sequential.

When performing the sort operation within a collector as in

static <T> Collector<T,?,List<T>> toSortedList(Comparator<? super T> c) {
    return Collectors.collectingAndThen(
        Collectors.toCollection(ArrayList::new), l->{ l.sort(c); return l; } );
}

Map<Gender, List<Person>> sortedListsByGender = roster.stream()
    .collect(Collectors.groupingBy(Person::getGender, toSortedList(Person::compareByAge)));

the sort operation behaves the same (thanks to Tagir Valeev for correcting me), but you can easily check how a sort-on-insertion strategy performs. Just change the collector implementation to:

static <T> Collector<T,?,List<T>> toSortedList(Comparator<? super T> c) {
    return Collectors.collectingAndThen(
        Collectors.toCollection(()->new TreeSet<>(c)), ArrayList::new);
}

For completeness, if you want a collector which inserts sorted into an ArrayList in the first place to avoid the final copy step, you can use a more elaborated collector like this:

static <T> Collector<T,?,List<T>> toSortedList(Comparator<? super T> c) {
    return Collector.of(ArrayList::new,
        (l,t) -> {
            int ix=Collections.binarySearch(l, t, c);
            l.add(ix<0? ~ix: ix, t);
        },
        (list1,list2) -> {
            final int s1=list1.size();
            if(list1.isEmpty()) return list2;
            if(!list2.isEmpty()) {
                list1.addAll(list2);
                if(c.compare(list1.get(s1-1), list2.get(0))>0)
                    list1.sort(c);
            }
            return list1;
        });
}

It’s efficient for the sequential usage but its merge function is not optimal. The underlying sort algorithm will benefit from presorted ranges but has to find these ranges first despite our merge function actually knows these ranges. Unfortunately, there’s no public API in the JRE allowing us to utilize these information (efficiently; we can pass subLists to binarySearch but creating a new sub list for each element of list2 may turn out to be too expensive). If we want to raise the performance of the parallel execution further, we have to re-implement the merge part of the sorting algorithm:

static <T> Collector<T,?,List<T>> toSortedList(Comparator<? super T> c) {
    return Collector.of(ArrayList::new,
        (l,t) -> l.add(insertPos(l, 0, l.size(), t, c), t),
        (list1,list2) -> merge(list1, list2, c));
}
static <T> List<T> merge(List<T> list1, List<T> list2, Comparator<? super T> c) {
    if(list1.isEmpty()) return list2;
    for(int ix1=0, ix2=0, num1=list1.size(), num2=list2.size(); ix2<num2; ix2++, num1++) {
        final T element = list2.get(ix2);
        ix1=insertPos(list1, ix1, num1, element, c);
        list1.add(ix1, element);
        if(ix1==num1) {
            while(++ix2<num2) list1.add(list2.get(ix2));
            return list1;
        }
    }
    return list1;
}
static <T> int insertPos(
    List<? extends T> list, int low, int high, T t, Comparator<? super T> c) {
    high--;
    while(low <= high) {
        int mid = (low+high)>>>1, cmp = c.compare(list.get(mid), t);
        if(cmp < 0) low = mid + 1;
        else if(cmp > 0) high = mid - 1;
        else {
            mid++;
            while(mid<=high && c.compare(list.get(mid), t)==0) mid++;
            return mid;
        }
    }
    return low;
}

Note that this last solution, unlike the simple binarySearch based insertion, is a stable sort implementation, i.e. in your case, Persons with the same age and Gender won’t change their relative order, if the source stream has a defined encounter order.

Erm... AFAIK, downstream finisher is [executed](http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/stream/Collectors.java#921) sequentially for all map entries during the upstream finisher execution and upstream finisher is always [executed](http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/stream/ReferencePipeline.java#503) only once in a caller thread. Finisher is never executed on intermediate containers. That would not work as finisher may change the object type. — Tagir Valeev, Mar 10 '16 at 01:09
@Tagir Valeev: you’re right, I was focused too much on the merge function. — Holger, Mar 10 '16 at 10:33
Thanks for your elaborated answer! I am still checking the performances (which vary heavily depending on amount of persons (and gender :-) ) and if sequential or parallel streams are used). Hopefully some of this will get implemented in future versions. — ctst, Mar 10 '16 at 12:30

sorting Lists after groupingBy

1 Answers1

Linked

Related