From the Spark source code:
/**
* Represents the content of the Dataset as an `RDD` of `T`.
*
* @group basic
* @since 1.6.0
*/
lazy val rdd: RDD[T] = {
val objectType = exprEnc.deserializer.dataType
rddQueryExecution.toRdd.mapPartitions { rows =>
rows.map(_.get(0, objectType).asInstanceOf[T])
}
}
The mapPartitions
can take as long as the time to compute the RDD
in the first place.. So this makes operations such as
df.rdd.getNumPartitions
very expensive. Given that a DataFrame
is DataSet[Row]
and a DataSet
is composed of RDD
's why is a re-mapping required? Any insights appreciated.