I would like to know how come in Spark we are not allowed to broadcast a RDD
but we can broadcast a DataFrame
?
val df = Seq(("t","t"),("t","f"),("f","t"),("f","f")).toDF("x1", "x2")
val rdd = df.rdd
val b_df = spark.sparkContext.broadcast(df) //you can do this!
val b_rdd = spark.sparkContext.broadcast(rdd) //IllegalArgumentException!
What's the use of a broadcasted DataFrame
? I know that we cannot operate on a RDD within another RDD transformation, but attempting to operate on a DataFrame
within a RDD transformation is also forbidden.
rdd.map(r => b_df.value.count).collect //SparkException
I am trying to find ways to exploit Spark's capabilities for the situation where I have to operate over a parallelized collection through transformations that involve invoking transformations/actions of other parallelized collections.