Grouping by object value, counting and then setting group key by maximum object attribute

Question

I have managed to write a solution using Java 8 Streams API that first groups a list of object Route by its value and then counts the number of objects in each group. It returns a mapping Route -> Long. Here is the code:

Map<Route, Long> routesCounted = routes.stream()
                .collect(Collectors.groupingBy(gr -> gr, Collectors.counting()));

And the Route class:

public class Route implements Comparable<Route> {
    private long lastUpdated;
    private Cell startCell;
    private Cell endCell;
    private int dropOffSize;

    public Route(Cell startCell, Cell endCell, long lastUpdated) {
        this.startCell = startCell;
        this.endCell = endCell;
        this.lastUpdated = lastUpdated;
    }

    public long getLastUpdated() {
        return this.lastUpdated;
    }

    public void setLastUpdated(long lastUpdated) {
        this.lastUpdated = lastUpdated;
    }

    public Cell getStartCell() {
        return startCell;
    }

    public void setStartCell(Cell startCell) {
        this.startCell = startCell;
    }

    public Cell getEndCell() {
        return endCell;
    }

    public void setEndCell(Cell endCell) {
        this.endCell = endCell;
    }

    public int getDropOffSize() {
        return this.dropOffSize;
    }

    public void setDropOffSize(int dropOffSize) {
        this.dropOffSize = dropOffSize;
    }

    @Override
    /**
     * Compute hash code by using Apache Commons Lang HashCodeBuilder.
     */
    public int hashCode() {
        return new HashCodeBuilder(43, 59)
                .append(this.startCell)
                .append(this.endCell)
                .toHashCode();
    }

    @Override
    /**
     * Compute equals by using Apache Commons Lang EqualsBuilder.
     */
    public boolean equals(Object obj) {
        if (!(obj instanceof Route))
            return false;
        if (obj == this)
            return true;

        Route route = (Route) obj;
        return new EqualsBuilder()
                .append(this.startCell, route.startCell)
                .append(this.endCell, route.endCell)
                .isEquals();
    }

    @Override
    public int compareTo(Route route) {
        if (this.dropOffSize < route.dropOffSize)
            return -1;
        else if (this.dropOffSize > route.dropOffSize)
            return 1;
        else {
                // if contains drop off timestamps, order by last timestamp in drop off
                // the highest timestamp has preceding
            if (this.lastUpdated < route.lastUpdated)
                return -1;
            else if (this.lastUpdated > route.lastUpdated)
                return 1;
            else
                return 0;
        }
    }
}

What I would like to additionally achieve is that the key for each group would be the one with the largest lastUpdated value. I was already looking at this solution but I do not know how to combine the counting and grouping by value and Route maximum lastUpdated value. Here is the example data of what I want to achieve:

EXAMPLE:

List<Route> routes = new ArrayList<>();
routes.add(new Route(new Cell(1, 2), new Cell(2, 1), 1200L));
routes.add(new Route(new Cell(3, 2), new Cell(2, 5), 1800L));
routes.add(new Route(new Cell(1, 2), new Cell(2, 1), 1700L));

SHOULD BE CONVERTED TO:

Map<Route, Long> routesCounted = new HashMap<>();
routesCounted.put(new Route(new Cell(1, 2), new Cell(2, 1), 1700L), 2);
routesCounted.put(new Route(new Cell(3, 2), new Cell(2, 5), 1800L), 1);

Notice that the key for mapping, which counted 2 Routes is the one with the largest lastUpdated value.

In example you are using `new Route` with 3 parameters, while the only constructor has 4 parameters. Could you please correct it? — Tagir Valeev, May 13 '15 at 09:24
Ups my bad. It is fixed now. Basically the dropOffSize size does not matter here, but I did leave it in the code, because I wanted to show all the overriden methods and compareTo method does make use of dropOffSize. — Jernej Jerin, May 13 '15 at 09:35

Misha · Accepted Answer · 2015-05-13T10:54:02.583

8

Here's one approach. First group into lists and then process the lists into the values you actually want:

import static java.util.Comparator.comparingLong;
import static java.util.stream.Collectors.groupingBy;
import static java.util.stream.Collectors.toMap;


Map<Route,Integer> routeCounts = routes.stream()
        .collect(groupingBy(x -> x))
        .values().stream()
        .collect(toMap(
            lst -> lst.stream().max(comparingLong(Route::getLastUpdated)).get(),
            List::size
        ));

edited May 13 '15 at 10:54

answered May 13 '15 at 09:37

Misha

27,433
6
62
78

I reckon this solution is performance heavy because it creates intermediate lists. Would be possible to get around this? Also I am getting the following two complaints from compiler: "Cannot resolve method 'stream()'" and "Cannot resolve method 'size()'" – Jernej Jerin May 13 '15 at 09:51
1

Just how huge is your list of routes? You could certainly do this in a more performant way, but at the cost of simplicity. As for compiler errors, are you using eclipse? I just tried it with jdk1.8.0_25 and it compiles fine. – Misha May 13 '15 at 09:54
I am using IntelliJ IDEA 14.1. The list of Routes is resizing with respect to time. It actually is not a list but a ArrayDeque as I am simulating moving window. The size of it should be around 1e4 to 2e4 I reckon. – Jernej Jerin May 13 '15 at 09:57
1

Maybe you don't have the static imports? try `Collectors.groupingBy` `Comparator.comparing` and `Collectors.toMap` instead of just the method names. – Misha May 13 '15 at 10:03
I already did that. I guess that lambda parameter lst is not of type List. – Jernej Jerin May 13 '15 at 10:10
1

`lst` is indeed of type `List`. I just fired up IntelliJ and it's not complaining. – Misha May 13 '15 at 10:19
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/77700/discussion-between-jernej-jerin-and-misha). – Jernej Jerin May 13 '15 at 10:23
As an alternative, you could also use `collectingAndThen(groupingBy(x -> x), m -> m.values()....);` – Alexis C. May 13 '15 at 11:21

Tagir Valeev · Answer 2 · 2015-11-23T12:42:49.287

7

You can define an abstract "library" method which combines two collectors into one:

static <T, A1, A2, R1, R2, R> Collector<T, ?, R> pairing(Collector<T, A1, R1> c1, 
        Collector<T, A2, R2> c2, BiFunction<R1, R2, R> finisher) {
    EnumSet<Characteristics> c = EnumSet.noneOf(Characteristics.class);
    c.addAll(c1.characteristics());
    c.retainAll(c2.characteristics());
    c.remove(Characteristics.IDENTITY_FINISH);
    return Collector.of(() -> new Object[] {c1.supplier().get(), c2.supplier().get()},
            (acc, v) -> {
                c1.accumulator().accept((A1)acc[0], v);
                c2.accumulator().accept((A2)acc[1], v);
            },
            (acc1, acc2) -> {
                acc1[0] = c1.combiner().apply((A1)acc1[0], (A1)acc2[0]);
                acc1[1] = c2.combiner().apply((A2)acc1[1], (A2)acc2[1]);
                return acc1;
            },
            acc -> {
                R1 r1 = c1.finisher().apply((A1)acc[0]);
                R2 r2 = c2.finisher().apply((A2)acc[1]);
                return finisher.apply(r1, r2);
            }, c.toArray(new Characteristics[c.size()]));
}

After that the actual operation may look like this:

Map<Route, Long> result = routes.stream()
        .collect(Collectors.groupingBy(Function.identity(),
            pairing(Collectors.maxBy(Comparator.comparingLong(Route::getLastUpdated)), 
                    Collectors.counting(), 
                    (route, count) -> new AbstractMap.SimpleEntry<>(route.get(), count))
            ))
        .values().stream().collect(Collectors.toMap(e -> e.getKey(), e -> e.getValue()));

Update: such collector is available in my StreamEx library: MoreCollectors.pairing(). Also similar collector is implemented in jOOL library, so you can use Tuple.collectors instead of pairing.

edited Nov 23 '15 at 12:42

answered May 13 '15 at 09:39

Tagir Valeev

97,161
19
222
334

Thank you for the solution but I am looking for something more elegant, like @Misha answer. – Jernej Jerin May 13 '15 at 09:54
1

I separated the "library" code from business logic. Library code still scary, but business code now looks much better. – Tagir Valeev May 13 '15 at 10:05
1

And now the whole thing in a type safe way , please ;-) Seriously, such a solution should be in `Collectors`… – Holger May 13 '15 at 10:06
1

@Holger, let's count how many times `@SuppressWarnings("unchecked")` appears in `java.util.stream.Collectors` :-) That's ok for low-level library code. Business part has no warnings. – Tagir Valeev May 13 '15 at 10:08
3

Well, replacing the `Object[]` with a `Pair` or tuple type would solve it. Now it’s the JRE developer’s turn… – Holger May 13 '15 at 10:13
2

@TagirValeev Plus one for effort and for using six generic type variables. :-) – Stuart Marks May 13 '15 at 22:08
2

@Holger We probably don't do anything like this until actual value types come along after Java 9. – Stuart Marks May 13 '15 at 22:09
2

By the way just discovered that similar collector appears in [jOOL](https://github.com/jOOQ/jOOL/blob/master/src/main/java/org/jooq/lambda/tuple/Tuple.java) library. They can merge up to 8 collectors using 17 generic parameters! – Tagir Valeev May 14 '15 at 02:31
3

Wasn't that an April Fools joke? Oh wait, that was this: http://blog.jooq.org/2015/04/01/dont-be-fooled-by-generics-and-backwards-compatibility-use-generic-generic-types/ – Stuart Marks May 14 '15 at 05:15

score 2 · Answer 3 · answered May 13 '15 at 09:37

Changed equals and hashcode to be dependent only on start cell and end cell.

@Override
    public boolean equals(Object o) {
        if (this == o) return true;
        if (o == null || getClass() != o.getClass()) return false;

        Cell cell = (Cell) o;

        if (a != cell.a) return false;
        if (b != cell.b) return false;

        return true;
    }

    @Override
    public int hashCode() {
        int result = a;
        result = 31 * result + b;
        return result;
    }

My solution looks like this:

Map<Route, Long> routesCounted = routes.stream()
            .sorted((r1,r2)-> (int)(r2.lastUpdated - r1.lastUpdated))
            .collect(Collectors.groupingBy(gr -> gr, Collectors.counting()));

Of course casting to int should be replaced with something more appropriated.

Note that javadoc for `groupingBy` doesn't specifically guarantee that this will work. — Misha, May 13 '15 at 09:41
And it will not actually work for parallel stream. But the idea looks interesting. — Tagir Valeev, May 13 '15 at 09:43

score 2 · Answer 4 · edited May 23 '17 at 12:18

In principle it seems like this ought to be doable in one pass. The usual wrinkle is that this requires an ad-hoc tuple or pair, in this case with a Route and a count. Since Java lacks these, we end up using an Object array of length 2 (as shown in Tagir Valeev's answer), or AbstractMap.SimpleImmutableEntry, or a hypothetical Pair<A,B> class.

The alternative is to write a little value class that holds a Route and a count. Of course there's some pain in doing this, but in this case I think it pays off because it provides a place to put the combining logic. That in turn simplifies the stream operation.

Here's the value class containing a Route and a count:

class RouteCount {
    final Route route;
    final long count;

    private RouteCount(Route r, long c) {
        this.route = r;
        count = c;
    }

    public static RouteCount fromRoute(Route r) {
        return new RouteCount(r, 1L);
    }

    public static RouteCount combine(RouteCount rc1, RouteCount rc2) {
        Route recent;
        if (rc1.route.getLastUpdated() > rc2.route.getLastUpdated()) {
            recent = rc1.route;
        } else {
            recent = rc2.route;
        }
        return new RouteCount(recent, rc1.count + rc2.count);
    }
}

Pretty straightforward, but notice the combine method. It combines two RouteCount values by choosing the Route that's been updated more recently and using the sum of the counts. Now that we have this value class, we can write a one-pass stream to get the result we want:

    Map<Route, RouteCount> counted = routes.stream()
        .collect(groupingBy(route -> route,
                    collectingAndThen(
                        mapping(RouteCount::fromRoute, reducing(RouteCount::combine)),
                        Optional::get)));

Like other answers, this groups the routes into equivalence classes based on the starting and ending cell. The actual Route instance used as the key isn't significant; it's just a representative of its class. The value will be a single RouteCount that contains the Route instance that has been updated most recently, along with the count of equivalent Route instances.

The way this works is that each Route instance that has the same start and end cells is then fed into the downstream collector of groupingBy. This mapping collector maps the Route instance into a RouteCount instance, then passes it to a reducing collector that reduces the instances using the combining logic described above. The and-then portion of collectingAndThen extracts the value from the Optional<RouteCount> that the reducing collector produces.

(Normally a bare get is dangerous, but we don't get to this collector at all unless there's at least one value available. So get is safe in this case.)

Great answer and well explained. I really like your solution. If I understand correctly this means that your solution will run in O(n) time while the @Misha solution is 2 * O(n), right? — Jernej Jerin, May 14 '15 at 06:09
@JernejJerin Strictly speaking, from a computer science point of view, 2*O(n) is the same as O(n). But I think you're referring to my approach as making one pass over the data vs. Misha's which makes two passes. That's true, but it doesn't necessarily follow that my approach is 2x the speed of Misha's; a two-pass approach could be the same speed as a one-pass approach that does twice the amount of work per element. I don't know if my approach in fact does 2x the work per element, but it does seem to do more per-element than each of Misha's passes. The overall amount of work seems similar. — Stuart Marks, May 14 '15 at 17:07
@JernejJerin I suspect though that the real costs would be in memory allocation, memory consumption, and GC pressure. Whether this is significant depends on a whole bunch of things. If the stream source isn't stored in memory, e.g., it's coming from a database or somewhere, the one-pass approach could have a big advantage in memory savings. But only if there are a lot of duplicate routes. If there are no duplicates, it ends up storing everything. Anyway, I think you can see the nuances. To find out which is better you'd have to benchmark. — Stuart Marks, May 14 '15 at 17:10
@StuartMarks You can also use the slightly shorter `toMap(r->r, RouteCount::fromRoute, RouteCount::combine)` instead of groupingBy+reducing — Misha, May 20 '15 at 08:28

Grouping by object value, counting and then setting group key by maximum object attribute

4 Answers4

Linked