This is pretty strange to me. I am familiar with the differences between map
and foreach
in Scala and the use cases for both but perhaps I don't understand something else. I first ran into this when I was playing around with Spark so it's possibly this only manifests itself when I am using an RDD.
Here is the following code in which the call to map is seemingly ignored. I am using Scala 2.11.1 and here is my dependencies for running the following code.
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.1.0",
"org.apache.spark" %% "spark-sql" % "2.1.0"
)
The below can be pasted in a scala console
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.rdd.RDD
import org.apache.spark.RangePartitioner
val conf: SparkConf = new SparkConf().setMaster("local").setAppName("Test")
val sc: SparkContext = new SparkContext(conf)
val rdd: RDD[Tuple2[String, String]] = sc.parallelize(List(
("I", "India"),
("U", "USA"),
("W", "West")))
val rp = new RangePartitioner(3, rdd)
val parts = rdd.partitionBy(rp).cache()
parts.mapPartitionsWithIndex( (x,y) => { y.map(println); y } ).collect()
When running this you can see that nothing is printed to stdout. However if you change the last line of code to parts.mapPartitionsWithIndex( (x,y) => { y.map(println) } ).collect()
or even this parts.mapPartitionsWithIndex( (x,y) => { y.foreach(println); y } ).collect()
it will be printed.
I believe this is different then the question about stdout not being output since I am in local mode and this is an issue with evaluation of the RDD not stdout.