Why does Spark JavaRDD flatmap function return an iterator

Question

I am trying to go through the java word count example. As I understand spark RDDs are a special type of collections, and flat map basically converts a nested collection such Stream> => Stream then Why does the spark Java API in the line below need to return an iterator for each line? And how is it used in the RDD?

Shouldn't the function just end at Arrays.asList(line.toString().split(" ")) ?

JavaRDD words =
                lines.flatMap(line -> Arrays.asList(line.toString().split(" ")).iterator());

sujit · Accepted Answer · 2018-03-22T08:57:39.723

1

In Java API, flatMap function takes an Object/Function of Functional Interface FlatMapFunction , whose contract(call function) is to return an Iterator:

java.util.Iterator< R> call(T t) throws Exception

Compare this to scala flatMap and you see something similar being the syntax there. But the authors have been able to implement it employing scala's implicit feature, so as to be much user friendly.

The reason for an Iterator< DiffObject> will make sense once you understand that map is supposed to return exactly same number of items that was input to it which may be of different type. But, flatMap can return any number (0 inclusive) of elements than the input which also may be of different type. Internally the implementation would be using the lambda you supplied, to get the final list by combining the output of those iterators.

edited Mar 22 '18 at 08:57

answered Mar 22 '18 at 08:47

sujit

2,258
1
15
24

Thanks @sujit so let me understand this better. Supposed I have a Stream< Stream > {{a,b},{c,d}} Then the lambda function will return an iterator i1 for iterating over {a,b} i2 for iterating over {c,d}. Then the flatmap function just uses these iterators internally to create a single Stream. – mdmac Mar 22 '18 at 17:58
1

First of all, I don't think Spark [JavaRDD](https://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaRDD.html) uses Java 8 Streams. Notice that it extends Object. It just uses a similar structure/semantics as the API was introduced post Java8 which has streams. More @ this [link](https://msystechnologies.com/apache-spark-rdd-java-8-streams/). For your usage scenario, you are just providing a lambda that takes a String and returns an Iterator. As per flatMap contract you could return Iterator. Consider marking as answer if this answers. – sujit Mar 23 '18 at 06:21
Ok that helps. Thanks a lot. – mdmac Mar 23 '18 at 16:18

Why does Spark JavaRDD flatmap function return an iterator

1 Answers1

Linked