21

I have a list of objects with many duplicated and some fields that need to be merged. I want to reduce this down to a list of unique objects using only Java 8 Streams (I know how to do this via old-skool means but this is an experiment.)

This is what I have right now. I don't really like this because the map-building seems extraneous and the values() collection is a view of the backing map, and you need to wrap it in a new ArrayList<>(...) to get a more specific collection. Is there a better approach, perhaps using the more general reduction operations?

    @Test
public void reduce() {
    Collection<Foo> foos = Stream.of("foo", "bar", "baz")
                     .flatMap(this::getfoos)
                     .collect(Collectors.toMap(f -> f.name, f -> f, (l, r) -> {
                         l.ids.addAll(r.ids);
                         return l;
                     })).values();

    assertEquals(3, foos.size());
    foos.forEach(f -> assertEquals(10, f.ids.size()));
}

private Stream<Foo> getfoos(String n) {
    return IntStream.range(0,10).mapToObj(i -> new Foo(n, i));
}

public static class Foo {
    private String name;
    private List<Integer> ids = new ArrayList<>();

    public Foo(String n, int i) {
        name = n;
        ids.add(i);
    }
}
ryber
  • 4,537
  • 2
  • 26
  • 50
  • 2
    Is it possible to implement this "old skool" (conventionally, without lambda/streams) without using an intermediate map? I think that, since duplicates potentially can occur anywhere in the input, they all have to be buffered up somewhere until all the input has been processed. – Stuart Marks Sep 21 '15 at 20:58

4 Answers4

18

If you break the grouping and reducing steps up, you can get something cleaner:

Stream<Foo> input = Stream.of("foo", "bar", "baz").flatMap(this::getfoos);

Map<String, Optional<Foo>> collect = input.collect(Collectors.groupingBy(f -> f.name, Collectors.reducing(Foo::merge)));

Collection<Optional<Foo>> collected = collect.values();

This assumes a few convenience methods in your Foo class:

public Foo(String n, List<Integer> ids) {
    this.name = n;
    this.ids.addAll(ids);
}

public static Foo merge(Foo src, Foo dest) {
    List<Integer> merged = new ArrayList<>();
    merged.addAll(src.ids);
    merged.addAll(dest.ids);
    return new Foo(src.name, merged);
}
Brian Kent
  • 3,754
  • 1
  • 26
  • 31
  • 1
    It amounts to almost the same thing - only you create a lot of new `Foo` objects on the way, and your list is a list of `Optional` rather than a list of `Foo`, which is not exactly clean. – RealSkeptic Sep 21 '15 at 17:47
  • Couldn't you just add the ids from the dest foo to the src rather than making a new Foo? – ryber Sep 21 '15 at 17:58
  • 3
    @ryber sure, but in a real world scenario that could easily create unexpected issues, especially if your reduction is running in parallel. I would recommend reducing mutability in your streaming operations. See: https://docs.oracle.com/javase/8/docs/api/java/util/stream/package-summary.html#Reduction. – Brian Kent Sep 21 '15 at 18:38
  • 1
    @RealSkeptic, for an arbitrary data model, you aren't necessarily going to have a logical "identity" type for the `reduce` so the `Optional` in the result is to be expected. – Brian Kent Sep 21 '15 at 18:41
  • 1
    Java 9 has an additional unwrap method that unwraps a stream of options to a stream of present Ts. I have a similar hand rolled method in my codebase. – ryber Sep 21 '15 at 18:54
  • 2
    Reduction in general without an identity value will return an empty `Optional` if the input is empty. But reduction within the context of `groupingBy` should never result in an empty `Optional` because the map entry will only get created if there's value present. So this is just noise in the API. For this example, the OP's three-arg `toMap()` call is probably preferable to groupingBy/reducing. – Stuart Marks Sep 21 '15 at 21:03
  • @ryber Are you referring to `streamOfOptionals.flatMap(Optional::stream)`? – M. Justin May 29 '20 at 20:37
6

As already pointed out in the comments, a map is a very natural thing to use when you want to identify unique objects. If all you needed to do was find the unique objects, you could use the Stream::distinct method. This method hides the fact that there is a map involved, but apparently it does use a map internally, as hinted by this question that shows you should implement a hashCode method or distinct may not behave correctly.

In the case of the distinct method, where no merging is necessary, it is possible to return some of the results before all of the input has been processed. In your case, unless you can make additional assumptions about the input that haven't been mentioned in the question, you do need to finish processing all of the input before you return any results. Thus this answer does use a map.

It is easy enough to use streams to process the values of the map and turn it back into an ArrayList, though. I show that in this answer, as well as providing a way to avoid the appearance of an Optional<Foo>, which shows up in one of the other answers.

public void reduce() {
    ArrayList<Foo> foos = Stream.of("foo", "bar", "baz").flatMap(this::getfoos)
            .collect(Collectors.collectingAndThen(Collectors.groupingBy(f -> f.name,
            Collectors.reducing(Foo.identity(), Foo::merge)),
            map -> map.values().stream().
                collect(Collectors.toCollection(ArrayList::new))));

    assertEquals(3, foos.size());
    foos.forEach(f -> assertEquals(10, f.ids.size()));
}

private Stream<Foo> getfoos(String n) {
    return IntStream.range(0, 10).mapToObj(i -> new Foo(n, i));
}

public static class Foo {
    private String name;
    private List<Integer> ids = new ArrayList<>();

    private static final Foo BASE_FOO = new Foo("", 0);

    public static Foo identity() {
        return BASE_FOO;
    }

    // use only if side effects to the argument objects are okay
    public static Foo merge(Foo fooOne, Foo fooTwo) {
        if (fooOne == BASE_FOO) {
            return fooTwo;
        } else if (fooTwo == BASE_FOO) {
            return fooOne;
        }
        fooOne.ids.addAll(fooTwo.ids);
        return fooOne;
    }

    public Foo(String n, int i) {
        name = n;
        ids.add(i);
    }
}
Community
  • 1
  • 1
Evan VanderZee
  • 797
  • 6
  • 10
  • 1
    Why all this `map.values().stream().collect(blahblah)`? Good old `map -> new ArrayList<>(map.values())` would be simpler and faster. – Tagir Valeev Sep 22 '15 at 05:52
  • @Tagir Valeev: if the only operations applied to the result are `size()` and `forEach()`, there is no reason to copy the `map.values()` collection into a new list at all. – Holger Sep 22 '15 at 08:33
2

If the input elements are supplied in the random order, then having intermediate map is probably the best solution. However if you know in advance that all the foos with the same name are adjacent (this condition is actually met in your test), the algorithm can be greatly simplified: you just need to compare the current element with the previous one and merge them if the name is the same.

Unfortunately there's no Stream API method which would allow you do to such thing easily and effectively. One possible solution is to write custom collector like this:

public static List<Foo> withCollector(Stream<Foo> stream) {
    return stream.collect(Collector.<Foo, List<Foo>>of(ArrayList::new,
             (list, t) -> {
                 Foo f;
                 if(list.isEmpty() || !(f = list.get(list.size()-1)).name.equals(t.name))
                     list.add(t);
                 else
                     f.ids.addAll(t.ids);
             },
             (l1, l2) -> {
                 if(l1.isEmpty())
                     return l2;
                 if(l2.isEmpty())
                     return l1;
                 if(l1.get(l1.size()-1).name.equals(l2.get(0).name)) {
                     l1.get(l1.size()-1).ids.addAll(l2.get(0).ids);
                     l1.addAll(l2.subList(1, l2.size()));
                 } else {
                     l1.addAll(l2);
                 }
                 return l1;
             }));
}

My tests show that this collector is always faster than collecting to map (up to 2x depending on average number of duplicate names), both in sequential and parallel mode.

Another approach is to use my StreamEx library which provides a bunch of "partial reduction" methods including collapse:

public static List<Foo> withStreamEx(Stream<Foo> stream) {
    return StreamEx.of(stream)
            .collapse((l, r) -> l.name.equals(r.name), (l, r) -> {
                l.ids.addAll(r.ids);
                return l;
            }).toList();
}

This method accepts two arguments: a BiPredicate which is applied for two adjacent elements and should return true if elements should be merged and the BinaryOperator which performs merging. This solution is a little bit slower in sequential mode than the custom collector (in parallel the results are very similar), but it's still significantly faster than toMap solution and it's simpler and somewhat more flexible as collapse is an intermediate operation, so you can collect in another way.

Again both these solutions work only if foos with the same name are known to be adjacent. It's a bad idea to sort the input stream by foo name, then using these solutions, because the sorting will drastically reduce the performance making it slower than toMap solution.

Tagir Valeev
  • 97,161
  • 19
  • 222
  • 334
1

As already pointed out by others, an intermediate Map is unavoidable, as that’s the way of finding the objects to merge. Further, you should not modify source data during reduction.

Nevertheless, you can achieve both without creating multiple Foo instances:

List<Foo> foos = Stream.of("foo", "bar", "baz")
                 .flatMap(n->IntStream.range(0,10).mapToObj(i -> new Foo(n, i)))

                 .collect(collectingAndThen(groupingBy(f -> f.name),
                    m->m.entrySet().stream().map(e->new Foo(e.getKey(),
                       e.getValue().stream().flatMap(f->f.ids.stream()).collect(toList())))
                    .collect(toList())));

This assumes that you add a constructor

    public Foo(String n, List<Integer> l) {
        name = n;
        ids=l;
    }

to your Foo class, as it should have if Foo is really supposed to be capable of holding a list of IDs. As a side note, having a type which serves as single item as well as a container for merged results seems unnatural to me. This is exactly why to code turns out to be so complicated.

If the source items had a single id, using something like groupingBy(f -> f.name, mapping(f -> id, toList()), followed by mapping the entries of (String, List<Integer>) to the merged items was sufficient.

Since this is not the case and Java 8 lacks the flatMapping collector, the flatmapping step is moved to the second step, making it look much more complicated.

But in both cases, the second step is not obsolete as it is where the result items are actually created and converting the map to the desired list type comes for free.

Holger
  • 285,553
  • 42
  • 434
  • 765
  • Immutable objects are certainly good, though it should be noted that current solution is about twice slower than OP's code. Using `flatMapping` collector it would likely be better... – Tagir Valeev Sep 22 '15 at 09:33
  • 1
    @Tagir Valeev: in this case, it’s not about the objects being immutable or not. It’s just about that reduction should not modify the source objects. I think, you can imagine how this can backfire if source objects are still being used… – Holger Sep 22 '15 at 09:37