5

I have to write some code that adds the content of a Java 8 Stream to a List multiple times, and I'm having trouble figuring out what's the best way to do it. Based on what I read on SO (mainly this question: How to add elements of a Java8 stream into an existing List) and elsewhere, I've narrowed it down to the following options:

import java.util.ArrayList;
import java.util.List;
import java.util.function.Function;
import java.util.stream.Collectors;

public class Accumulator<S, T> {


    private final Function<S, T> transformation;
    private final List<T> internalList = new ArrayList<T>();

    public Accumulator(Function<S, T> transformation) {
        this.transformation = transformation;
    }

    public void option1(List<S> newBatch) {
        internalList.addAll(newBatch.stream().map(transformation).collect(Collectors.toList()));
    }

    public void option2(List<S> newBatch) {
        newBatch.stream().map(transformation).forEach(internalList::add);
    }
}

The idea is that the methods would be called multiple times for the same instance of Accumulator. The choice is between using an intermediate list and callingCollection.addAll() once outside of the stream or calling collection.add() from the stream for each element.

I would tend to prefer option 2 which is more in the spirit of functional programming, and avoid creating an intermediate list, however, there might be benefits to calling addAll() instead of calling add() n times when n is large.

Is one of the two options significantly better than the other ?

EDIT: JB Nizet has a very cool answer that delays the transformation until all batches have been added. In my case, it is required that the transformation is performed straight-away.

PS: In my example code, I've used transformation as a placeholder for whatever operations which need to be performed on the stream

Community
  • 1
  • 1
ahelix
  • 308
  • 1
  • 2
  • 13
  • The disassembled bytecode (javap) might help you figure out –  Sep 14 '16 at 16:25
  • 1
    Don't do premature optimizations. Do whatever is cleaner, and only if you hit performance problems check this code with a profiler. – Jaroslaw Pawlak Sep 14 '16 at 16:28
  • I don't think there are any benefits to calling `addAll()`. – shmosel Sep 14 '16 at 16:34
  • 3
    Beware that if you do parallel streaming, the results will be different, unless you change `forEach()` to `forEachOrdered()`. – Andreas Sep 14 '16 at 16:41
  • 2
    Why is it required “that the transformation is performed straight-away”? If you care about the “spirit of functional programming”, it shouldn’t matter when the transformation is performed. – Holger Sep 14 '16 at 16:55
  • @Holger let's assume that I have a job that runs for 30 minutes and calls `optionX()` every 15 seconds. Depending on what processing needs to be performed, it might not be possible or desirable to perform everything at the end. – ahelix Sep 14 '16 at 17:05
  • As this is a long running take it might be worth using multiple threads which is changes the answers as it would need to be thread safe. i.e. I would favour option2 except option1 is thread safe, possibly better. – Peter Lawrey Sep 14 '16 at 17:23

2 Answers2

8

First of all, your second variant should be:

public void option2(List<S> newBatch) {
    newBatch.stream().map(transformation).forEachOrdered(internalList::add);
}

to be correct.

Besides that, the advantage of addAll in

public void option1(List<S> newBatch) {
    internalList.addAll(newBatch.stream()
        .map(transformation).collect(Collectors.toList()));
}

is moot as the Collector API does not allow the Stream to provide hints about the expected size to the Collector and requires the Stream to evaluate the accumulator function for every element, which is nothing else than ArrayList::add in the current implementation.

So before this approach could get any benefit from addAll, it filled an ArrayList by repeatedly calling add on an ArrayList, including potential capacity increase operations. So you can stay with option2 without regret.

An alternative is to use a stream builder for temporary collections:

public class Accumulator<S, T> {
    private final Function<S, T> transformation;
    private final Stream.Builder<T> internal = Stream.builder();

    public Accumulator(Function<S, T> transformation) {
        this.transformation = transformation;
    }

    public void addBatch(List<S> newBatch) {
        newBatch.stream().map(transformation).forEachOrdered(internal);
    }

    public List<T> finish() {
        return internal.build().collect(Collectors.toList());
    }
}

The stream builder uses a spined buffer which does not require copying the contents when increasing its capacity, but the solution still suffers from the fact that the final collection step involves filling an ArrayList without an appropriate initial capacity (in the current implementation).

With the current implementation, it’s far more efficient to implement the finishing step as

public List<T> finish() {
    return Arrays.asList(internal.build().toArray(…));
}

But this requires either, an IntFunction<T[]> provided by the caller (as we can’t do that for a generic array type), or to perform an unchecked operation (pretending an Object[] to be T[], which would be ok here, but still a nasty unchecked operation).

Starting with JDK 16, you can use

public List<T> finish() {
    return internal.build().toList();
}

which returns an immutable list without the difficulties with Generic types, while having similar performance characteristics as the Arrays.asList(internal.build().toArray(…)) approach (in typical implementations).

Holger
  • 285,553
  • 42
  • 434
  • 765
6

The best solution would be a third one, avoiding that internal list completely. Just let the stream create the final list for you:

Assuming you have a List<List<S>>, containing your N batches, on which the same transformation must be applied, you would do

List<T> result = 
    batches.stream()
           .flatMap(batch -> batch.stream())
           .map(transformation)
           .collect(Collectors.toList());
JB Nizet
  • 678,734
  • 91
  • 1,224
  • 1,255
  • This is best option ..as this will create fixed size list. – ranjeet Sep 14 '16 at 16:43
  • 2
    Good answer, but it requires that you can defer the processing of earlier batches until all batches are ready. That may not always be the case. – Andreas Sep 14 '16 at 16:43
  • Good answer, I wouldn't have thought of that, although in my case I'm trying not to defer the processing, as @Andreas mentioned. I'm editing the question accordingly – ahelix Sep 14 '16 at 16:47
  • 4
    @rana_stack You think `Collectors.toList` creates a presized list? It's far away from doing that. Have a look at the [code](http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8-b132/java/util/stream/Collectors.java#Collectors.toList%28%29) or Holger's answer which describes what's actually happening. – Stefan Zobel Sep 14 '16 at 18:04