22

Assume we have a collection of objects that are identified by unique Strings, along with a class Tree that defines a hierarchy on them. That class is implemented using a Map from nodes (represented by their IDs) to Collections of their respective children's IDs.

class Tree {
  private Map<String, Collection<String>> edges;

  // ...

  public Stream<String> descendants(String node) {
    // To be defined.
  }
}

I would like to enable streaming a node's descendants. A simple solution is this:

private Stream<String> children(String node) {
    return edges.getOrDefault(node, Collections.emptyList()).stream();
}

public Stream<String> descendants(String node) {
    return Stream.concat(
        Stream.of(node),
        children(node).flatMap(this::descendants)
    );
}

Before continuing, I would like to make the following assertions about this solution. (Am I correct about these?)

  1. Walking the Stream returned from descendants consumes resources (time and memory) – relative to the size of the tree – in the same order of complexity as hand-coding the recursion would. In particular, the intermediate objects representing the iteration state (Streams, Spliterators, ...) form a stack and therefore the memory requirement at any given time is in the same order of complexity as the tree's depth.

  2. As I understand this, as soon as I perform a terminating operation on the Stream returned from descendants, the root-level call to flatMap will cause all contained Streams – one for each (recursive) call to descendants – to be realized immediately. Thus, the resulting Stream is only lazy on the first level of recursion, but not beyond. (Edited according to Tagir Valeevs answer.)

If I understood these points correctly, my question is this: How can I define descendants so that the resulting Stream is lazy?

I would like the solution to be as elegant as possible, in the sense that I prefer a solution which leaves the iteration state implicit. (To clarify what I mean by that: I know that I could write a Spliterator that walks the tree while maintaining an explicit stack of Spliterators on each level. I would like to avoid that.)

(Is there possibly a way in Java to formulate this as a producer-consumer workflow, like one could use in languages like Julia and Go?)

Community
  • 1
  • 1
user4235730
  • 2,559
  • 1
  • 23
  • 36
  • 1
    A custom `Spliterator` seems to be what you want... – fge Sep 23 '15 at 20:47
  • 1
    very similar question - http://stackoverflow.com/questions/32656888/recursive-use-of-stream-flatmap – ZhongYu Sep 23 '15 at 21:14
  • 3
    I would seriously consider using something like Guava's `TreeTraverser` and then wrapping that as a `Stream`. – Louis Wasserman Sep 23 '15 at 23:38
  • 1
    [this answer](http://stackoverflow.com/a/31254841/1362755) contains some suggestions to increase laziness, maybe its applicable to your situation. – the8472 Sep 24 '15 at 15:49

5 Answers5

17

To me, your solution is already as elegant as possible and the limited laziness of it not your fault. The simplest solution is to wait until it gets fixed by the JRE developers. It has been done with Java 10.

However, if this limited laziness of today’s implementation really is a concern, it’s perhaps time to solve this in a general way. Well, it is about implementing a Spliterator, but not specific to your task. Instead, it’s a re-implementation of the flatmap operation serving all cases where the limited laziness of the original implementation matters:

public class FlatMappingSpliterator<E,S> extends Spliterators.AbstractSpliterator<E>
implements Consumer<S> {

    static final boolean USE_ORIGINAL_IMPL
        = Boolean.getBoolean("stream.flatmap.usestandard");

    public static <T,R> Stream<R> flatMap(
        Stream<T> in, Function<? super T,? extends Stream<? extends R>> mapper) {

        if(USE_ORIGINAL_IMPL)
            return in.flatMap(mapper);

        Objects.requireNonNull(in);
        Objects.requireNonNull(mapper);
        return StreamSupport.stream(
            new FlatMappingSpliterator<>(sp(in), mapper), in.isParallel()
        ).onClose(in::close);
    }

    final Spliterator<S> src;
    final Function<? super S, ? extends Stream<? extends E>> f;
    Stream<? extends E> currStream;
    Spliterator<E> curr;

    private FlatMappingSpliterator(
        Spliterator<S> src, Function<? super S, ? extends Stream<? extends E>> f) {
        // actually, the mapping function can change the size to anything,
        // but it seems, with the current stream implementation, we are
        // better off with an estimate being wrong by magnitudes than with
        // reporting unknown size
        super(src.estimateSize()+100, src.characteristics()&ORDERED);
        this.src = src;
        this.f = f;
    }

    private void closeCurr() {
        try { currStream.close(); } finally { currStream=null; curr=null; }
    }

    public void accept(S s) {
        curr=sp(currStream=f.apply(s));
    }

    @Override
    public boolean tryAdvance(Consumer<? super E> action) {
        do {
            if(curr!=null) {
                if(curr.tryAdvance(action))
                    return true;
                closeCurr();
            }
        } while(src.tryAdvance(this));
        return false;
    }

    @Override
    public void forEachRemaining(Consumer<? super E> action) {
        if(curr!=null) {
            curr.forEachRemaining(action);
            closeCurr();
        }
        src.forEachRemaining(s->{
            try(Stream<? extends E> str=f.apply(s)) {
                if(str!=null) str.spliterator().forEachRemaining(action);
            }
        });
    }

    @SuppressWarnings("unchecked")
    private static <X> Spliterator<X> sp(Stream<? extends X> str) {
        return str!=null? ((Stream<X>)str).spliterator(): null;
    }

    @Override
    public Spliterator<E> trySplit() {
        Spliterator<S> split = src.trySplit();
        if(split==null) {
            Spliterator<E> prefix = curr;
            while(prefix==null && src.tryAdvance(s->curr=sp(f.apply(s))))
                prefix=curr;
            curr=null;
            return prefix;
        }
        FlatMappingSpliterator<E,S> prefix=new FlatMappingSpliterator<>(split, f);
        if(curr!=null) {
            prefix.curr=curr;
            curr=null;
        }
        return prefix;
    }
}

All you need for using it, is to add a import static of the flatMap method to your code and change expressions of the form stream.flatmap(function) to flatmap(stream, function).

I.e. in your code

public Stream<String> descendants(String node) {
    return Stream.concat(
        Stream.of(node),
        flatMap(children(node), this::descendants)
    );
}

then you have full lazy behavior. I tested it even with infinite streams…

Note that I added a toggle to allow turning back to the original implementation, e.g. when specifying    -Dstream.flatmap.usestandard=true on the command line.

Holger
  • 285,553
  • 42
  • 434
  • 765
  • 1
    Interesting solution. Try `flatMap(IntStream.range(0, 1000000).boxed(), Stream::of).parallel().collect(Collectors.summingInt(Integer::intValue))` and compare with JDK version. Your one is enormously slow (like 30-50 times slower). – Tagir Valeev Sep 25 '15 at 07:41
  • 3
    Note that when raising the per-item overhead as in `flatMap(IntStream.range(0, 1000).boxed().parallel(), Stream::of) .map(i->{ LockSupport.parkNanos(1); return i; }) .collect(Collectors.summingInt(Integer::intValue));` the execution *does* benefit from parallel execution, and both implementations are on par. – Holger Sep 25 '15 at 08:49
  • 1
    Why relying on the feelings when you can [measure](https://gist.github.com/amaembo/d86d51c332a724da488e). JDK flatMap does parallel and even with additional peek operation it much faster than yours. Stream pipeline is not smart enough to measure the Q of your task. It cannot estimate whether `Integer::intValue` is fast or not, it's just a black-box. Even without peek it starts all the FJP threads. You can check via `Thread.enumerate` after task finishes. `IntStream.range(0,2)` will start only 1 additional thread, for example. – Tagir Valeev Sep 25 '15 at 09:00
  • 4
    Oh, I see. The real problem with my test was that I should parallelize *before* your flatMap as adding innocent `boxed()` already makes source spliterator unsplittable. This works fine `flatMap(IntStream.range(0, 1000000).boxed().parallel(), Stream::of).collect(Collectors.summingInt(Integer::intValue))`. – Tagir Valeev Sep 25 '15 at 09:18
  • 1
    The JDK flatMap does parallel, that’s right as my example with higher workload shows. It also starts new threads in your example, but the assignment of workloads seem to differ. Btw. on my system, the placement of `parallel()` made no difference—it could split the source range and splitting the substreams (the part within `if(split == null)`) can’t make a difference in general, as in this example, the substreams have the length one. But for whatever reason, the `Stream` implementation splits the range again and again, instead of creating < *number of cores* > working sets. – Holger Sep 25 '15 at 09:44
  • 1
    Actually the reason is that you refuse to estimate the size of your task, just pass always Long.MAX_VALUE. Thus the Stream engine thinks that the task is still big enough and it's reasonable to split it further. Doing more splitting than having cores is very reasonable as you can finish your work faster in case of exception/short-circuit or have more balanced load if the processing time differs significantly on the beginning and on the end of your stream. In this case Stream engine heavily relies on the reported estimated size. – Tagir Valeev Sep 25 '15 at 09:52
  • 4
    `Long.MAX_VALUE` is specified by the API as “unknown size”, not “very big size”. If the stream implementation interprets it that wrong, it’s a bug. Since the function can return anything from empty to infinite stream, I don’t think that it is valid to estimate any size. The reasoning is wrong either. If sub tasks complete faster due to exception or short circuiting, they shouldn’t get new tasks as the entire operation should end anyway. Nevertheless, for practical purposes, I add it. – Holger Sep 25 '15 at 10:54
  • 1
    @Holger: “The simplest solution is to wait until it gets fixed by the JRE developers.” – are you aware of any such plans? – user4235730 Sep 25 '15 at 18:21
  • I accepted this answer mainly because of its first paragraph: it seems to me that right now there is no elegant solution to my problem, so I will go with the original solution for now and accept the consequences. But thank you @Holger for that flatMap implementation, it was very educational. – user4235730 Sep 29 '15 at 18:25
  • 2
    Maybe you could have a look at [this](https://github.com/JavaChat/streems); it may help you – fge Apr 29 '16 at 11:12
5

You're a little bit wrong saying that the flatMap stream is not lazy. It somewhat lazy, though it's laziness is really limited. Let's use some custom Collection to track the requested elements inside your Tree class:

private final Set<String> requested = new LinkedHashSet<>();

private class MyList extends AbstractList<String> implements RandomAccess
{
    private final String[] data;

    public MyList(String... data) {
        this.data = data;
    }

    @Override
    public String get(int index) {
        requested.add(data[index]);
        return data[index];
    }

    @Override
    public int size() {
        return data.length;
    }
}

Now let's pre-initialize your class with some tree data:

public Tree() {
    // "1" is the root note, contains three immediate descendants
    edges.put("1", new MyList("2", "3", "4"));
    edges.put("2", new MyList("5", "6", "7"));
    edges.put("3", new MyList("8", "9", "10"));
    edges.put("8", new MyList("11", "12"));
    edges.put("5", new MyList("13", "14", "15"));
    edges.put("7", new MyList("16", "17", "18"));
    edges.put("6", new MyList("19", "20"));
}

Finally let's check how many elements are actually requested from your list on different limit values:

public static void main(String[] args) {
    for(int i=1; i<=20; i++) {
        Tree tree = new Tree();
        tree.descendants("1").limit(i).toArray();
        System.out.println("Limit = " + i + "; requested = (" + tree.requested.size()
                + ") " + tree.requested);
    }
}

The output is the following:

Limit = 1; requested = (0) []
Limit = 2; requested = (12) [2, 5, 13, 14, 15, 6, 19, 20, 7, 16, 17, 18]
Limit = 3; requested = (12) [2, 5, 13, 14, 15, 6, 19, 20, 7, 16, 17, 18]
Limit = 4; requested = (12) [2, 5, 13, 14, 15, 6, 19, 20, 7, 16, 17, 18]
Limit = 5; requested = (12) [2, 5, 13, 14, 15, 6, 19, 20, 7, 16, 17, 18]
Limit = 6; requested = (12) [2, 5, 13, 14, 15, 6, 19, 20, 7, 16, 17, 18]
Limit = 7; requested = (12) [2, 5, 13, 14, 15, 6, 19, 20, 7, 16, 17, 18]
Limit = 8; requested = (12) [2, 5, 13, 14, 15, 6, 19, 20, 7, 16, 17, 18]
Limit = 9; requested = (12) [2, 5, 13, 14, 15, 6, 19, 20, 7, 16, 17, 18]
Limit = 10; requested = (12) [2, 5, 13, 14, 15, 6, 19, 20, 7, 16, 17, 18]
Limit = 11; requested = (12) [2, 5, 13, 14, 15, 6, 19, 20, 7, 16, 17, 18]
Limit = 12; requested = (12) [2, 5, 13, 14, 15, 6, 19, 20, 7, 16, 17, 18]
Limit = 13; requested = (12) [2, 5, 13, 14, 15, 6, 19, 20, 7, 16, 17, 18]
Limit = 14; requested = (18) [2, 5, 13, 14, 15, 6, 19, 20, 7, 16, 17, 18, 3, 8, 11, 12, 9, 10]
Limit = 15; requested = (18) [2, 5, 13, 14, 15, 6, 19, 20, 7, 16, 17, 18, 3, 8, 11, 12, 9, 10]
Limit = 16; requested = (18) [2, 5, 13, 14, 15, 6, 19, 20, 7, 16, 17, 18, 3, 8, 11, 12, 9, 10]
Limit = 17; requested = (18) [2, 5, 13, 14, 15, 6, 19, 20, 7, 16, 17, 18, 3, 8, 11, 12, 9, 10]
Limit = 18; requested = (18) [2, 5, 13, 14, 15, 6, 19, 20, 7, 16, 17, 18, 3, 8, 11, 12, 9, 10]
Limit = 19; requested = (18) [2, 5, 13, 14, 15, 6, 19, 20, 7, 16, 17, 18, 3, 8, 11, 12, 9, 10]
Limit = 20; requested = (19) [2, 5, 13, 14, 15, 6, 19, 20, 7, 16, 17, 18, 3, 8, 11, 12, 9, 10, 4]

Thus when only the root note is requested, no access to children is performed (as Stream.concat is smart). When the first immediate child is requested, the whole subtree for this child is processed even if it's unnecessary. Nevertheless the second immediate child is not processed until the first one finishes. This could be problematic for short-circuiting scenarios, but in most of the cases your terminal operation is not short-circuiting, thus it's still fine approach.

As for your concerns about memory consumption: yes, it eats the memory according to the tree depth (and more importantly it eats the stack). If your tree has thousands nesting levels, you will have the problem with your solution as you may hit StackOverflowError for default -Xss setting. For several hundreds levels of depth it would work fine.

We are using similar approach in business-logic layer of our application, it works fine for us, though our trees are rarely deeper than 10 levels.

Tagir Valeev
  • 97,161
  • 19
  • 222
  • 334
  • So in other words, the `Stream` is lazy on the first level, but not beyond. I will edit the question to reflect that. The fundamental problem stays the same, though, right? – user4235730 Sep 24 '15 at 07:23
  • 1
    @user4235730, basically yes. Still many things depend on how to define lazyness. If you want to collect the results into list (with some filtering, mapping, etc.), or use forEach terminal operation, then the next element will not be requested from the tree until the previous one is processed. The non-laziness you can see on short-circuit only or if you use stream.iterator()/stream.spliterator() directly. – Tagir Valeev Sep 24 '15 at 07:31
1

Not a real answer, but just a thought:

If you peek into the value collection and on the next step "resolve" that last seen value to a new value collection returning the next values in the same way recursively, then however this is implemented, it will always end up with some kind of "pointer" to the current element in the value collection on the current "level" of depth in the tree, and also with some kind of stack holding all those "pointers".

This is, because you need both the information about the higher levels in the tree (stack) and a "pointer" to the current element at the current level. In this case, one causes the other.

Of course, you can implement this as a Spliterator that holds a Stack of Iterators (pointing to the corresponding value collection), but I suppose there will always be a "pointer" state at each depth level, even if it's hidden in Java's flatMap (or related) temporary objects.

As an alternative: how about using a "real" tree with nodes that hold a reference to its parent node? Plus, adding a map to the root of the tree which holds a reference to all single nodes to simplify the access to a sub-sub-sub-child. I guess the Spliterator implementation would then be really easy because it just needs a reference to the current node for traversing and a stop criteria (the initial node value) to stop walking too "high" up in the tree.

jCoder
  • 2,289
  • 23
  • 22
  • I am aware that the iteration state has to live somewhere, but I wanted it to be implicit. If I hand-roll a recursion, there is also a stack, but from the perspective of iteration, it is implicit in the runtime environment. If I use a higher-order iteration primitive like `flatMap`, the iteration state is hidden in that primitive, and that is also fine by me. I just wanted to avoid defining my own object holding the iteration state. Regarding your alternative: I don't get what that would buy us. Could you please elaborate on how the `Spliterator` implementation would then be easier? – user4235730 Sep 24 '15 at 07:19
  • And by the way, if this is “not a real answer”, maybe it should be one or more comments instead? – user4235730 Sep 24 '15 at 07:20
  • Comment was too long for comment field, so I went for an answer. When focusing on `Spliterator.tryAdvance` then only the index of the sub-node and the current node has to be remembered there; the depth inside the tree is automatically maintained by the fact that each node knows its parent. No need for a stack. But that would be a generally different data structure than yours. – jCoder Sep 24 '15 at 18:15
  • Just to make sure I understood you correctly: you are talking about the *left-child, right-sibling* tree representation? In that case: yes, one would only need the current node. But what do you mean by the “sub-node index”? – user4235730 Sep 25 '15 at 18:16
  • Depending on the implementation and assuming that one parent can have more than one or two children then an index for which child is next might be needed. But I guess this approach goes much too far in the wrong direction according to your scenario, so better stick to the one of the other ideas ;) – jCoder Sep 25 '15 at 20:42
0

I suggest something that in fact is similar to what you didn't want but easier and more elegant in implementation than straight forward maintaining a stack

public class TreeIterator {
    private Tree tree;
    private List<String> topLevelNodes;

    public TreeIterator(Tree t, String node) {
        topLevelNodes = new List();
        topLevelNodes.add(node);
        tree = t;
    }

    public String next() {
        if (topLevelNodes.size() > 0) {
            int last = topLevelNodes.size() - 1;
            String result = topLevelNodes.get(last);
            topLevelNodes.remove(last);
            topLevelNodes.addAll(tree.get(result));
            return result;
        }
        return null;
    }
}

Sorry for new List() and other incorrect things, just wanted to share the idea.

sbeliakov
  • 2,169
  • 1
  • 20
  • 37
  • Although you call it a `List`, `topLevelNodes` here really is a stack: you only append and remove at the end. And compared to a stack of iterators, this materializes the objects under iteration at every level, thus it needs more memory than necessary. (The order of complexity is no longer the height, but the height times the node degree instead.) Sorry to say, but I don't find this to be “more elegant”. – user4235730 Sep 24 '15 at 07:09
  • Yes, I agree with your arguments – sbeliakov Sep 24 '15 at 07:41
  • 2
    Why do you use `get(last)`, followed by `remove(last)`? That’s an unnecessary step. Just use `String result = topLevelNodes.remove(last);`. Besides that, using an `ArrayDeque` to be able to efficiently remove from the head would allow to maintain the correct ordering of the nodes… – Holger Sep 24 '15 at 09:14
  • Thanks for the remark about `remove`. However removing from the head and inserting to the tail will increase memory consumtion, so `list` (or just `stack`) fits better. – sbeliakov Sep 24 '15 at 09:23
0

Let's answer the question first by providing technical discussion --

  1. A TreeNode may also hold a reference to a user object, the use of which is left to the user. Asking a TreeNode for its string representation with toString() returns the string representation of its user object.
  2. A tree node may have at most one parent and 0 or more children. TreeNode provides operations for examining and modifying a node's parent and children and also operations for examining the tree that the node is a part of. A node's tree is the set of all nodes that can be reached by starting at the node and following all the possible links to parents and children. A node with no parent is the root of its tree; a node with no children is a leaf. A tree may consist of many subtrees, each node acting as the root for its own subtree.
  3. The existing DefaultMutableTrrNode of the Java 8 is modified.
  4. This class provides enumerations for efficiently traversing a tree or subtree in various orders or for following the path between two nodes.
  5. This is not a thread safe class.If you intend to use a TreeNode (or a tree of TreeNodes) in more than one thread, you need to do your own synchronizing. A good convention to adopt is synchronizing on the root node of a tree.
  6. Serialized objects of this class will not be compatible with future Swing releases. The current serialization support is appropriate for short term storage or RMI between applications running the same version of Swing. As of 1.4, support for long term storage of all JavaBeans™ has been added to the java.beans package.

Check this modified version of TreeNode contributed in Git - TreeNode

Vaibhav Atray
  • 208
  • 4
  • 14