0

The second time I run a query it's significantly faster. Why?

Code:

publicvoidtest3() {
    Dataset<Row>SQLDF=spark.read().json(path:"src/main/resources/data/ipl.json");
    SQLDF.repartition(2);
    Dataset<Row>result1=SqlDF.where("run>10000").select(col:"team",...cols:"run");
    //Dataset<Row>cachedPartition=result1.cache();
    result.collect();
    //result1.show();log.info("PhysicalPlan\n"+result1.queryExecution().executedPlan());

    Dataset<Row>result2=SqlDF.where("run>10000").select(col:"team",..cols:"run");
    result2.collect();
    //result1.show();
    Log.info("PhysicalPlan\n"+result2.queryExecution().executedPlanq);
}

Physical plans:

enter image description here

Execution time on spark UI:

enter image description here

Why these queries are taking different time and why there is so much difference in execution time? Is caching happening under the hood? If yes, why it is not mentioned in physical plan?

Lars Skaug
  • 1,376
  • 1
  • 7
  • 13
akash patel
  • 163
  • 9
  • 1
    Please don't post photos of your screen. Representative data and code should be provided as text in your post. Is your question why the query is faster the second time you run it? – Lars Skaug Oct 07 '20 at 20:30

1 Answers1

3

You're pointing Spark to a file. The second time you access the same file, the file will be accessed faster.

It's the same situation if you run the following code twice (except Scala uses the JVM and java.nio and java.io, of course).

with open("src/main/resources/data/ipl.json") as f:
    t = f.read()
print(t)

The first time, the I/O operation will have to be initialized. The second time, the I/O operation can reuse parts of the last run. If the file is small (as it seems to be in your case), the whole file will have been cached.

Lars Skaug
  • 1,376
  • 1
  • 7
  • 13