I'm trying to compare the performance of Spark queries on datasets based on Parquet files and cached dataset.
Surprisingly queries on Parquet dataset are faster than queries on cached data. I see at least 2 reasons why it should not:
- cached data is in memory while parquet file isn't (it's on my SSD)
- I'm expecting cached data to be in an optimized format for spark queries
I've done this small benchmark on a 300MB parquet (9M lines), only timing the query time, not the time to cache the data:
def benchmarkSum(ds: org.apache.spark.sql.DataFrame): Double = {
var begin = System.nanoTime();
for (int <- 1 to 1000) {
ds.groupBy().sum("columnName").first()
}
return (System.nanoTime() - begin) / 1000000000.0;
}
val pqt = spark.read.parquet("myfile.parquet");
benchmarkSum(pqt) // 54s
var cached = pqt.cache()
cached.groupBy().sum("columnName").first() // One first call to triggers the caching before benchmark.
benchmarkSum(cached) // 77s
The queries on Parquet took 54s while it took 77s on the cached dataset.
I am doing this benchmark in a spark-shell
with 8 cores and 10GB memory.
So why is it slower to use cached data to sum my column? Am I doing something wrong?