2

Is there a better way to get a Stream<T> object from a JavaRDD<T>?

This my current and obvious solution:

JavaRDD<T> rdd = ...;
Stream<T> stream = rdd.collect().stream();

I'm wondering whether is at all possible to avoid creating an intermediary list that will have to hold all the elements in memory at once.

Valentin Ruano
  • 2,726
  • 19
  • 29
  • 3
    Since there is a [`toLocalIterator`](https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/api/java/JavaRDDLike.html#toLocalIterator()) method, you could use [this answer](http://stackoverflow.com/a/24511534/5743988). – 4castle Mar 24 '17 at 23:07
  • @4castle sounds promissing, does that avoid the all-in-memory-at-once issue? or would be equivalent to rdd.collect().iterator()? – Valentin Ruano Mar 24 '17 at 23:19
  • 1
    I'm not familiar with Spark at all, but the documentation says *"The iterator will consume as much memory as the largest partition in this RDD."* Hopefully that answers your question? – 4castle Mar 24 '17 at 23:34
  • @4castle, Yes it does, thanks. – Valentin Ruano Mar 28 '17 at 13:30

0 Answers0