I have a Dataframe (df
) that I want to use in a different SparkSession
, in the same SparkContext
.
I am creating a new dataframe with
val newDf = spark.createDataFrame(df.rdd, df.schema)
What I want to understand is the performance difference of a simple operation like df.count()
The original DF returns almost immediately.
The newly created DF takes much longer.
The original DF plan shows the reads against original Parquet files ("Scan Parquet" operator)
The new DF plan shows "Scan ExistingRDD", which makes sense; but I don't understand why it should be so different since the existing RDD is ultimately coming from the same Parquet.
What am I missing?