3

So I have some code using Java 8 streams, and it works. It does exactly what I need it to do, and it's legible (a rarity for functional programming). Towards the end of a subroutine, the code runs over a List of a custom pair type:

// All names Hungarian-Notation-ized for SO reading
class AFooAndABarWalkIntoABar
{
    public int      foo_int;
    public BarClass bar_object;
    ....
}

List<AFooAndABarWalkIntoABar> results = ....;

The data here must be passed into other parts of the program as arrays, so they get copied out:

// extract either a foo or a bar from each "foo-and-bar" (fab)
int[] foo_array = results.stream()
    .mapToInt (fab -> fab.foo_int)
    .toArray();

BarClass[] bar_array = results.stream()
    .map (fab -> fab.bar_object)
    .toArray(BarClass[]::new);

And done. Now each array can go do its thing.

Except... that loop over the List twice bothers me in my soul. And if we ever need to track more information, they're likely going to add a third field, and then have to make a third pass to turn the 3-tuple into three arrays, etc. So I'm fooling around with trying to do it in a single pass.

Allocating the data structures is trivial, but maintaining an index for use by the Consumer seems hideous:

int[] foo_array = new int[results.size()];
BarClass[] bar_array = new BarClass[results.size()];

// the trick is providing a stateful iterator across the array:
// - can't just use 'int', it's not effectively final
// - an actual 'final int' would be hilariously wrong
// - "all problems can be solved with a level of indirection"
class Indirection { int iterating = 0; }
final Indirection sigh = new Indirection();
// equivalent possibility is
//    final int[] disgusting = new int[]{ 0 };
// and then access disgusting[0] inside the lambda
// wash your hands after typing that code

results.stream().forEach (fab -> {
    foo_array[sigh.iterating] = fab.foo_int;
    bar_array[sigh.iterating] = fab.bar_object;
    sigh.iterating++;
});

This produces identical arrays as the existing solution using multiple stream loops. And it does so in about half the time, go figure. But the iterator indirection tricks seem so unspeakably ugly, and of course preclude any possibility of populating the arrays in parallel.

Using a pair of ArrayList instances, created with appropriate capacity, would let the Consumer code simply call add for each instance, and no external iterator needed. But ArrayList's toArray(T[]) has to perform a copy of the storage array again, and in the int case there's boxing/unboxing on top of that.

(edit: The answers to the "possible duplicate" question all talk about only maintaining the indices in a stream, and using direct array indexing to get to the actual data during filter/map calls, along with a note that it doesn't really work if the data isn't accessible by direct index. While this question has a List and is "directly indexable" only from a viewpoint of "well, List#get exists, technically". If the results collection above is a LinkedList, for example, then calling an O(n) get N times with nonconsecutive index would be... bad.)

Are there other, better, possibilities that I'm missing? I thought a custom Collector might do it, but I can't figure out how to maintain the state there either and never even got as far as scratch code.

Ti Strga
  • 1,353
  • 18
  • 40
  • Possible duplicate of [Is there a concise way to iterate over a stream with indices in Java 8?](http://stackoverflow.com/questions/18552005/is-there-a-concise-way-to-iterate-over-a-stream-with-indices-in-java-8) – MikaelF Jan 30 '17 at 04:54
  • Actually, iterating over a trivial in-memory collection twice doesn’t bother me in any way. All solutions to this non-problem raise complexity and reduce performance. And the original code actually has a perfect road map to supporting collection of a third field into a third array. You even don’t have to change the existing code… – Holger Jan 30 '17 at 10:44
  • @Holger I appreciate that you've thought about the problem and responded, but at no point did I say that the `results` was "a trivial in-memory collection". It isn't. Even the hackish example solution included in the post mentioned that it runs in about half the time, so I don't know why you're asserting that "all solutions reduce performance" and that the question doesn't even qualify as a problem. – Ti Strga Jan 30 '17 at 18:50
  • Well, it pretends to be a `List`. If it’s iteration costs are that high, you would get better results fixing the design error that lies already before that, instead of finding a workaround for that Stream code. And no, `LinkedList` doesn’t have this performance penalty, when being iterated multiple times… – Holger Jan 30 '17 at 19:00
  • It's a `List` because that's what the creating third-party Collector says it is. We can't change that code. And looking at the source of `LinkedList#get` (and the `node()` accessor, I'm definitely seeing linear traversal from the nearest endpoint... The complaint about LinkedList is about `get` indexing, not traversal. Fortunately, what's currently being returned is not concretely LinkedList, but I can't assume it won't change. – Ti Strga Jan 30 '17 at 19:38

4 Answers4

6

As the size of stream is known, there is no reason of reinventing the wheel again. The simplest solution is usually the best one. The second approach you have shown is nearly there - just use AtomicIntegeras array index and you will achieve your goal - single pass over data, and possible parralel stream execution ( due to AtomicInteger).

SO

AtomicInteger index=new AtomicInteger()
results.parallelStream().forEach (fab -> {
    int idx=index.getAndIncrement();
    foo_array[idx] = fab.foo_int;
    bar_array[idx] = fab.bar_object;
});

Thread safe for parralel execution. One iteratio over whole collection

Antoniossss
  • 31,590
  • 6
  • 57
  • 99
  • I'm marking this as the answer due to its simplicity, but the caveat about non-deterministic ordering of the output arrays is a major caution to anybody reading this in the future. – Ti Strga Jan 31 '17 at 22:26
3

If your prerequisites are that both, iterating the list and accessing the list via an index, are expensive operations, there is no chance of getting a benefit from the parallel Stream processing. You can try to go with this answer, if you don’t need the result values in the original list order.

Otherwise, you can’t benefit from the parallel Stream processing as it requires the source to be able to efficiently split its contents into two halves, which implies either, random access or fast iteration. If the source has no customized spliterator, the default implementation will try to enable parallel processing via buffering elements into an array, which already implies iterating before the parallel processing even starts and having additional array storage costs where your sole operation is an array storage operation anyway.

When you accept that there is no benefit from parallel processing, you can stay with your sequential solution, but solve the ugliness of the counter by moving it into the Consumer. Since lambda expressions don’t support this, you can turn to the good old anonymous inner class:

int[]      foo_array = new int[results.size()];
BarClass[] bar_array = new BarClass[results.size()];

results.forEach(new Consumer<AFooAndABarWalkIntoABar>() {
    int index=0;
    public void accept(AFooAndABarWalkIntoABar t) {
        foo_array[index]=t.foo_int;
        bar_array[index]=t.bar_object;
        index++;
    }
});

Of course, there’s also the often-overlooked alternative of the good old for-loop:

int[]      foo_array = new int[results.size()];
BarClass[] bar_array = new BarClass[results.size()];

int index=0;
for(AFooAndABarWalkIntoABar t: results) {
    foo_array[index]=t.foo_int;
    bar_array[index]=t.bar_object;
    index++;
}

I wouldn’t be surprised, if this beats all other alternatives performance-wise for your scenario…

Community
  • 1
  • 1
Holger
  • 285,553
  • 42
  • 434
  • 765
  • Yeah, I have some anonymous inner class implementations of Consumer in unrelated places (not for indexing, but to capture other data unsupported by lambda). Until our upstream data provider does something smarter, I'll probably submit a patch to use a simple for loop like this instead. If Streams won't help us, there's no reason to use them! Alternatively, if they'll give me access to their code, I'll reimplement the incoming Stream to be smart about parallel operations downstream (where this code lives). – Ti Strga Jan 30 '17 at 19:43
0

A way to reuse an index in a stream is to wrap your lambda in an IntStream that is in charge of incrementing the index:

IntStream.range(0, results.size()).forEach(i -> {
    foo_array[i] = results.get(i).foo_i;
    bar_array[i] = results.get(i).bar_object;
});

With regards to Antoniossss's answer, using an IntStream seems like a slightly preferable option to using AtomicInteger:

  • It also works with parallel();
  • Two less local variable;
  • Leaves the Stream API in charge of parallel processing;
  • Two less lines of code.

EDIT: as Mikhail Prokhorov pointed out, calling the get method twice on implementations such as LinkedList will be slower than other solutions, given the O(n) complexity of their implementations of get. This can be fixed with:

AFooAndABarWalkIntoABar temp = results.get(i);
foo_array[i] = temp.foo_i;
bar_array[i] = temp.bar_object;
MikaelF
  • 3,518
  • 4
  • 20
  • 33
  • 1
    But depending on what a `results` is (for example, linked list), getting `i` out of it may involve a lot more iteration than using variant with `AtomicInteger` solution. – M. Prokhorov Jan 30 '17 at 15:53
  • Bubbling out `get(i)` so that it's a temp variable inside the lambda is better than calling it twice, but it's still calling a potentially O(n) `get` each time the lambda executes in an already-O(n) traversal. – Ti Strga Jan 30 '17 at 18:26
  • I can see how my wording was confusing, thanks for pointing that out. I edited my answer. – MikaelF Jan 30 '17 at 19:14
0

Java 12 adds a teeing collector which provides an approach to do this in one pass. Here is some example code using Apache Commons Pair class.

import org.apache.commons.lang3.tuple.Pair;

import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;
import java.util.stream.Stream;

class Scratch {

    public static void main(String[] args) {
        final Stream<Pair<String, String>> pairs = Stream.of(
                Pair.of("foo1", "bar1"),
                Pair.of("foo2", "bar2"),
                Pair.of("foo3", "bar3")
        );

        final Pair<List<String>, List<String>> zipped = pairs
                .collect(Collectors.teeing(
                        Collectors.mapping(Pair::getLeft, Collectors.toList()),
                        Collectors.mapping(Pair::getRight, Collectors.toList()),
                        (lefts, rights) -> Pair.of(lefts, rights)
                        ));

        // Then get the arrays out
        String[] lefts = zipped.getLeft().toArray(String[]::new);
        String[] rights = zipped.getRight().toArray(String[]::new);

        System.out.println(Arrays.toString(lefts));
        System.out.println(Arrays.toString(rights));
    }
}

The output will be

[foo1, foo2, foo3]
[bar1, bar2, bar3]

It does not require the stream size to be known ahead of time.

James Mudd
  • 1,816
  • 1
  • 20
  • 25