A DataFrame is an optimized distributed tabular collection. Since it keeps a tabular format (similar to a SQL table) it can mantain metadata to allow Spark some optimizations performed under the hood.
This optimizations are performed by side project such as Catalyst and Tungsten
RDD does not mantain any schema, it is required for you to provide one if needed. So RDD is not as highly oiptimized as Dataframe, (Catalyst is not involved at all)
Converting a DataFrame to an RDD force Spark to loop over all the elements converting them from the highly optimized Catalyst space to the scala one.
Check the code from .rdd
lazy val rdd: RDD[T] = {
val objectType = exprEnc.deserializer.dataType
rddQueryExecution.toRdd.mapPartitions { rows =>
rows.map(_.get(0, objectType).asInstanceOf[T])
}
}
@transient private lazy val rddQueryExecution: QueryExecution = {
val deserialized = CatalystSerde.deserialize[T](logicalPlan)
sparkSession.sessionState.executePlan(deserialized)
}
So first, it's executing the plan and retrieve the output as an RDD[InternalRow]
which, as the name implies, are only for internal use and need to be converted to RDD[Row]
Then it loops over all the rows converting them. As you can see, it's not just removing the schema
Hope that answer your question.