0

I would like to know how come in Spark we are not allowed to broadcast a RDD but we can broadcast a DataFrame?

val df = Seq(("t","t"),("t","f"),("f","t"),("f","f")).toDF("x1", "x2")
val rdd = df.rdd
val b_df = spark.sparkContext.broadcast(df) //you can do this!
val b_rdd = spark.sparkContext.broadcast(rdd) //IllegalArgumentException!

What's the use of a broadcasted DataFrame? I know that we cannot operate on a RDD within another RDD transformation, but attempting to operate on a DataFrame within a RDD transformation is also forbidden.

rdd.map(r => b_df.value.count).collect //SparkException

I am trying to find ways to exploit Spark's capabilities for the situation where I have to operate over a parallelized collection through transformations that involve invoking transformations/actions of other parallelized collections.

Jane Wayne
  • 8,205
  • 17
  • 75
  • 120

1 Answers1

3

That's because DataFrame is not necessarily distributed. If you check carefully you'll see that Dataset provides isLocal method that:

Returns true if the collect and take methods can be run locally (without any Spark executors).

Local DataFrames can be even used, although it is not advertised, in a task - Why does this Spark code make NullPointerException?

Broadcasting Dataset uses similar mechanism - it collects data to create local object and then broadcasts it. So it is not much more than a syntactic sugar for collect followed by broadcast (under the covers it uses more sophisticated approach than collect, to avoid transformation to external format) which can be done with RDD.

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
  • Like an RDD, a DataFrame is an immutable distributed collection of data. I am not sure you have addressed the question. – thebluephantom Aug 09 '18 at 10:36
  • @thebluephantom Dataset (DataFrame) is a broader abstraction. It doesn't necessarily represent distributed object (local Dataset I mentioned above, which might be processed without triggering Spark job) or collection (Streaming Dataset). So instead _DataFrame is ..._ it should be _DataFrame can be ..._ – Alper t. Turker Aug 09 '18 at 10:40
  • https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html got it from here from the databricks blog. I think I get your point but for me it is distributed in the mist general sense. Thx for your speedy reply – thebluephantom Aug 09 '18 at 10:44
  • When you say it can be run locally, do you mean locally on the driver node or worker nodes? Or both? If you look at my code above, a broadcasted `DataFrame` cannot have an action invoked within a outer `RDD` transformation. – Jane Wayne Aug 10 '18 at 18:10
  • @JaneWayne I would use the stuff as intended in a general sense. Broadcast is for small data, large have other aspects. I think he means both. Peoplea always talk about Drivers, but they also unwittingly mean Workers. – thebluephantom Aug 12 '18 at 07:56