The second time I run a query it's significantly faster. Why?
Code:
publicvoidtest3() {
Dataset<Row>SQLDF=spark.read().json(path:"src/main/resources/data/ipl.json");
SQLDF.repartition(2);
Dataset<Row>result1=SqlDF.where("run>10000").select(col:"team",...cols:"run");
//Dataset<Row>cachedPartition=result1.cache();
result.collect();
//result1.show();log.info("PhysicalPlan\n"+result1.queryExecution().executedPlan());
Dataset<Row>result2=SqlDF.where("run>10000").select(col:"team",..cols:"run");
result2.collect();
//result1.show();
Log.info("PhysicalPlan\n"+result2.queryExecution().executedPlanq);
}
Physical plans:
Execution time on spark UI:
Why these queries are taking different time and why there is so much difference in execution time? Is caching happening under the hood? If yes, why it is not mentioned in physical plan?