Why does a Spark query run faster when it's executed a second time?

Question

The second time I run a query it's significantly faster. Why?

Code:

publicvoidtest3() {
    Dataset<Row>SQLDF=spark.read().json(path:"src/main/resources/data/ipl.json");
    SQLDF.repartition(2);
    Dataset<Row>result1=SqlDF.where("run>10000").select(col:"team",...cols:"run");
    //Dataset<Row>cachedPartition=result1.cache();
    result.collect();
    //result1.show();log.info("PhysicalPlan\n"+result1.queryExecution().executedPlan());

    Dataset<Row>result2=SqlDF.where("run>10000").select(col:"team",..cols:"run");
    result2.collect();
    //result1.show();
    Log.info("PhysicalPlan\n"+result2.queryExecution().executedPlanq);
}

Physical plans:

Execution time on spark UI:

Why these queries are taking different time and why there is so much difference in execution time? Is caching happening under the hood? If yes, why it is not mentioned in physical plan?

Please don't post photos of your screen. Representative data and code should be provided as text in your post. Is your question why the query is faster the second time you run it? — Lars Skaug, Oct 07 '20 at 20:30

Lars Skaug · Answer 1 · 2020-10-07T20:50:00.787

3

You're pointing Spark to a file. The second time you access the same file, the file will be accessed faster.

It's the same situation if you run the following code twice (except Scala uses the JVM and java.nio and java.io, of course).

with open("src/main/resources/data/ipl.json") as f:
    t = f.read()
print(t)

The first time, the I/O operation will have to be initialized. The second time, the I/O operation can reuse parts of the last run. If the file is small (as it seems to be in your case), the whole file will have been cached.

edited Oct 07 '20 at 20:50

answered Oct 07 '20 at 20:43

Lars Skaug

1,376
1
7
13

2

I also guess that 'warming up' of JVM is taking some time of the first run. – busfighter Oct 07 '20 at 20:47

Why does a Spark query run faster when it's executed a second time?

1 Answers1

Linked