175

When would you use collect() vs reduce()? Does anyone have good, concrete examples of when it's definitely better to go one way or the other?

Javadoc mentions that collect() is a mutable reduction.

Given that it's a mutable reduction, I assume it requires synchronization (internally) which, in turn, can be detrimental to performance. Presumably reduce() is more readily parallelizable at the cost of having to create a new data structure for return after every step in the reduce.

The above statements are guesswork however and I'd love an expert to chime in here.

Zaki
  • 6,997
  • 6
  • 37
  • 53
jimhooker2002
  • 1,835
  • 2
  • 12
  • 5
  • 1
    The rest of the page you linked to explains it: *As with reduce(), a benefit of expressing collect in this abstract way is that it is directly amenable to parallelization: we can accumulate partial results in parallel and then combine them, so long as the accumulation and combining functions satisfy the appropriate requirements.* – JB Nizet Mar 22 '14 at 11:57
  • 3
    also see "Streams in Java 8: Reduce vs. Collect" by Angelika Langer - https://www.youtube.com/watch?v=oWlWEKNM5Aw – MasterJoe Mar 07 '20 at 00:19

8 Answers8

141

reduce is a "fold" operation, it applies a binary operator to each element in the stream where the first argument to the operator is the return value of the previous application and the second argument is the current stream element.

collect is an aggregation operation where a "collection" is created and each element is "added" to that collection. Collections in different parts of the stream are then added together.

The document you linked gives the reason for having two different approaches:

If we wanted to take a stream of strings and concatenate them into a single long string, we could achieve this with ordinary reduction:

 String concatenated = strings.reduce("", String::concat)  

We would get the desired result, and it would even work in parallel. However, we might not be happy about the performance! Such an implementation would do a great deal of string copying, and the run time would be O(n^2) in the number of characters. A more performant approach would be to accumulate the results into a StringBuilder, which is a mutable container for accumulating strings. We can use the same technique to parallelize mutable reduction as we do with ordinary reduction.

So the point is that the parallelisation is the same in both cases but in the reduce case we apply the function to the stream elements themselves. In the collect case we apply the function to a mutable container.

Boris the Spider
  • 59,842
  • 6
  • 106
  • 166
  • 1
    If this is the case for collect: " A more performant approach would be to accumulate the results into a StringBuilder" then why would we ever use reduce? – jimhooker2002 Mar 23 '14 at 09:30
  • 2
    @Jimhooker2002 reread it. If you are, say, calculating the product then the reduction function can simply be applied to the split streams in parallel and then combined together at the end. The process of reducing always results in the the type as the stream. Collecting is used when you want to collect the results into a mutable container, i.e. when the result is a _different_ type to the stream. This has the advantage that a **single instance** of the container can be used for each split stream but the disadvantage that the containers need to be combined at the end. – Boris the Spider Mar 23 '14 at 09:35
  • 1
    @jimhooker2002 in the product example, `int` is **immutable** so you cannot readily use a collect operation. You could do a dirty hack like use an `AtomicInteger` or some custom `IntWrapper` but why would you? A fold operation is simply different to a collecting operation. – Boris the Spider Mar 23 '14 at 09:38
  • 20
    There is also another `reduce` method, where you can return objects of type different from elements of the stream. – Konstantin Milyutin Oct 14 '14 at 10:25
  • @damluar yes, but it assumes that the other type is immutable. – Boris the Spider Oct 14 '14 at 12:06
  • @BoristheSpider, what will happen when we use `reduce` with mutable object like `StringBuilder`? I assume the problem can arise only for parallel streams in this case or? – Konstantin Milyutin Oct 14 '14 at 12:43
  • How can the parallelism be the same in both cases if the `collect()` uses a mutable container that can't be shared among threads? – Dmitry Minkovsky Jul 27 '15 at 22:29
  • It can be shared among threads because copies of objects will be created by the framework. – Sandro Aug 02 '16 at 17:23
  • How will we be using StringBuilder with collect ? – Pratik Singhal Sep 25 '17 at 06:21
  • @PratikSinghal I don't quite understand your question. See [`StringJoiner`](http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8-b132/java/util/StringJoiner.java) and [`Collectors.joining`](https://docs.oracle.com/javase/8/docs/api/java/util/stream/Collectors.html#joining--). – Boris the Spider Sep 25 '17 at 06:32
  • 1
    one more case where u would use collect instead of reduce is when reduce operation involves adding elements to a collection, then every time your accumulator function processes an element, it creates a new collection that includes the element, which is inefficient. – raghu May 12 '18 at 09:50
  • @raghu isn't that exactly what the answer describes? "_aggregation operation where a "collection" is created and each element is "added" to that collection_" – Boris the Spider May 12 '18 at 09:58
  • @BoristheSpider Oh yes, missed it. – raghu May 12 '18 at 10:01
  • 1
    The First distinction is wrong. the reduce method can return a value whose type differs from the stream element's type. – qartal Feb 15 '19 at 19:57
  • *`collection` is an aggregation operation*, `collection` should be `collect`. – Jason Law Jan 30 '20 at 03:40
  • @JasonLaw how has no one spotted that in all these years? Thanks! – Boris the Spider Jan 30 '20 at 06:45
  • @BoristheSpider - Does String concatenation with collect() use a StringBuilder behind the scenes ? – MasterJoe Mar 07 '20 at 01:01
  • @BoristheSpider , will you please help me to understand how O(n2) is calculated here . String concatenated = strings.reduce("", String::concat) I did a sample program using custom accumulator and I got O(n) result . Will you please help me . – Knowledge_seeker Mar 11 '21 at 18:13
  • 1
    @Knowledge_seeker https://www.joelonsoftware.com/2001/12/11/back-to-basics/ – Boris the Spider Mar 11 '21 at 22:01
55

The reason is simply that:

  • collect() can only work with mutable result objects.
  • reduce() is designed to work with immutable result objects.

"reduce() with immutable" example

public class Employee {
  private Integer salary;
  public Employee(String aSalary){
    this.salary = new Integer(aSalary);
  }
  public Integer getSalary(){
    return this.salary;
  }
}

@Test
public void testReduceWithImmutable(){
  List<Employee> list = new LinkedList<>();
  list.add(new Employee("1"));
  list.add(new Employee("2"));
  list.add(new Employee("3"));

  Integer sum = list
  .stream()
  .map(Employee::getSalary)
  .reduce(0, (Integer a, Integer b) -> Integer.sum(a, b));

  assertEquals(Integer.valueOf(6), sum);
}

"collect() with mutable" example

E.g. if you would like to manually calculate a sum using collect() it can not work with BigDecimal but only with MutableInt from org.apache.commons.lang.mutable for example. See:

public class Employee {
  private MutableInt salary;
  public Employee(String aSalary){
    this.salary = new MutableInt(aSalary);
  }
  public MutableInt getSalary(){
    return this.salary;
  }
}

@Test
public void testCollectWithMutable(){
  List<Employee> list = new LinkedList<>();
  list.add(new Employee("1"));
  list.add(new Employee("2"));

  MutableInt sum = list.stream().collect(
    MutableInt::new, 
    (MutableInt container, Employee employee) -> 
      container.add(employee.getSalary().intValue())
    , 
    MutableInt::add);
  assertEquals(new MutableInt(3), sum);
}

This works because the accumulator container.add(employee.getSalary().intValue()); is not supposed to return a new object with the result but to change the state of the mutable container of type MutableInt.

If you would like to use BigDecimal instead for the container you could not use the collect() method as container.add(employee.getSalary()); would not change the container because BigDecimal it is immutable. (Apart from this BigDecimal::new would not work as BigDecimal has no empty constructor)

Sandro
  • 1,757
  • 1
  • 19
  • 29
  • @Sandro - I am confused. Why do you say that collect() works only with mutable objects ? I used it to concatenate strings. String allNames = employees.stream() .map(Employee::getNameString) .collect(Collectors.joining(", ")) .toString(); – MasterJoe Mar 07 '20 at 00:24
  • 1
    @MasterJoe2 It's simple. In short - the implementation still uses the `StringBuilder` which is mutable. See: http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/util/stream/Collectors.java#l263 – Sandro Mar 08 '20 at 12:37
39

The normal reduction is meant to combine two immutable values such as int, double, etc. and produce a new one; it’s an immutable reduction. In contrast, the collect method is designed to mutate a container to accumulate the result it’s supposed to produce.

To illustrate the problem, let's suppose you want to achieve Collectors.toList() using a simple reduction like

List<Integer> numbers = stream.reduce(
        new ArrayList<Integer>(),
        (List<Integer> l, Integer e) -> {
            l.add(e);
            return l;
        },
        (List<Integer> l1, List<Integer> l2) -> {
            l1.addAll(l2);
            return l1;
        });

This is the equivalent of Collectors.toList(). However, in this case you mutate the List<Integer>. As we know the ArrayList is not thread-safe, nor is safe to add/remove values from it while iterating so you will either get concurrent exception or ArrayIndexOutOfBoundsException or any kind of exception (especially when run in parallel) when you update the list or the combiner tries to merge the lists because you are mutating the list by accumulating (adding) the integers to it. If you want to make this thread-safe you need to pass a new list each time which would impair performance.

In contrast, the Collectors.toList() works in a similar fashion. However, it guarantees thread safety when you accumulate the values into the list. From the documentation for the collect method:

Performs a mutable reduction operation on the elements of this stream using a Collector. If the stream is parallel, and the Collector is concurrent, and either the stream is unordered or the collector is unordered, then a concurrent reduction will be performed. When executed in parallel, multiple intermediate results may be instantiated, populated, and merged so as to maintain isolation of mutable data structures. Therefore, even when executed in parallel with non-thread-safe data structures (such as ArrayList), no additional synchronization is needed for a parallel reduction.

So to answer your question:

When would you use collect() vs reduce()?

if you have immutable values such as ints, doubles, Strings then normal reduction works just fine. However, if you have to reduce your values into say a List (mutable data structure) then you need to use mutable reduction with the collect method.

Jason Law
  • 965
  • 1
  • 9
  • 21
george
  • 3,102
  • 5
  • 34
  • 51
  • In the code snippet I think the problem is it will take the identity (in this case a single instance of an ArrayList) and assume it is "immutable" so they can start `x` threads, each "adding to the identity" then combining together. Good example. – rogerdpack Mar 02 '18 at 23:34
  • why we would get concurrent modification exception , calling streams is just gonna retrun serial stream and which means its gonna be processed by single thread and combiner function is not at all called ? – amarnath harish Aug 24 '18 at 11:53
  • `public static void main(String[] args) { List l = new ArrayList<>(); l.add(1); l.add(10); l.add(3); l.add(-3); l.add(-4); List numbers = l.stream().reduce( new ArrayList(), (List l2, Integer e) -> { l2.add(e); return l2; }, (List l1, List l2) -> { l1.addAll(l2); return l1; });for(Integer i:numbers)System.out.println(i); } }` i tried and didnt get CCm exception – amarnath harish Aug 24 '18 at 11:58
  • 1
    @amarnathharish the problem occurs when you attempt to run it in parallel and multiple threads try to access the same list – george Apr 15 '19 at 09:17
14

Let the stream be a <- b <- c <- d

In reduction,

you will have ((a # b) # c) # d

where # is that interesting operation that you would like to do.

In collection,

your collector will have some kind of collecting structure K.

K consumes a. K then consumes b. K then consumes c. K then consumes d.

At the end, you ask K what the final result is.

K then gives it to you.

Yan Ng
  • 151
  • 1
  • 2
3

They are very different in the potential memory footprint during the runtime. While collect() collects and puts all data into the collection, reduce() explicitly asks you to specify how to reduce the data that made it through the stream.

For example, if you want to read some data from a file, process it, and put it into some database, you might end up with java stream code similar to this:

streamDataFromFile(file)
            .map(data -> processData(data))
            .map(result -> database.save(result))
            .collect(Collectors.toList());

In this case, we use collect() to force java to stream data through and make it save the result into the database. Without collect() the data is never read and never stored.

This code happily generates a java.lang.OutOfMemoryError: Java heap space runtime error, if the file size is large enough or the heap size is low enough. The obvious reason is that it tries to stack all the data that made it through the stream (and, in fact, has already been stored in the database) into the resulting collection and this blows up the heap.

However, if you replace collect() with reduce() -- it won't be a problem anymore as the latter will reduce and discard all the data that made it through.

In the presented example, just replace collect() with something with reduce:

.reduce(0L, (aLong, result) -> aLong, (aLong1, aLong2) -> aLong1);

You do not need even to care to make the calculation depend on the result as Java is not a pure FP (functional programming) language and cannot optimize out the data that is not being used at the bottom of the stream because of the possible side-effects.

averasko
  • 970
  • 7
  • 15
  • 3
    If you don't care about the results of your db save, you should use forEach... you don't need to use reduce. Unless this was for illustrative purposes. – DaveEdelstein Dec 08 '16 at 20:53
3

Here is the code example

List<Integer> list = Arrays.asList(1,2,3,4,5,6,7);
int sum = list.stream().reduce((x,y) -> {
        System.out.println(String.format("x=%d,y=%d",x,y));
        return (x + y);
    }).get();

System.out.println(sum);

Here is the execute result:

x=1,y=2
x=3,y=3
x=6,y=4
x=10,y=5
x=15,y=6
x=21,y=7
28

Reduce function handle two parameters, the first parameter is the previous return value int the stream, the second parameter is the current calculate value in the stream, it sum the first value and current value as the first value in next caculation.

JetQin
  • 85
  • 3
1

According to the docs

The reducing() collectors are most useful when used in a multi-level reduction, downstream of groupingBy or partitioningBy. To perform a simple reduction on a stream, use Stream.reduce(BinaryOperator) instead.

So basically you'd use reducing() only when forced within a collect. Here's another example:

 For example, given a stream of Person, to calculate the longest last name 
 of residents in each city:

    Comparator<String> byLength = Comparator.comparing(String::length);
    Map<String, String> longestLastNameByCity
        = personList.stream().collect(groupingBy(Person::getCity,
            reducing("", Person::getLastName, BinaryOperator.maxBy(byLength))));

According to this tutorial reduce is sometimes less efficient

The reduce operation always returns a new value. However, the accumulator function also returns a new value every time it processes an element of a stream. Suppose that you want to reduce the elements of a stream to a more complex object, such as a collection. This might hinder the performance of your application. If your reduce operation involves adding elements to a collection, then every time your accumulator function processes an element, it creates a new collection that includes the element, which is inefficient. It would be more efficient for you to update an existing collection instead. You can do this with the Stream.collect method, which the next section describes...

So the identity is "re-used" in a reduce scenario, so slightly more efficient to go with .reduce if possible.

rogerdpack
  • 62,887
  • 36
  • 269
  • 388
1

There is a very good reason to always prefer collect() vs the reduce() method. Using collect() is much more performant, as explained here:

Java 8 tutorial

*A mutable reduction operation(such as Stream.collect()) collects the stream elements in a mutable result container(collection) as it processes them. Mutable reduction operations provide much improved performance when compared to an immutable reduction operation(such as Stream.reduce()).

This is due to the fact that the collection holding the result at each step of reduction is mutable for a Collector and can be used again in the next step.

Stream.reduce() operation, on the other hand, uses immutable result containers and as a result needs to instantiate a new instance of the container at every intermediate step of reduction which degrades performance.*

Alex Mi
  • 1,409
  • 2
  • 21
  • 35