I am calculating how long it take to complete a particular job on Spark, in this case how long it takes to save an output RDD. The saving of the RDD involves compressing it.
What is weird is that first execution of the code is always slower, compared to a second execution of exactly same piece of code. How can this be?
The Spark program looks like the following:
JavaPairRDD<String, String> semisorted = wordsAndChar.sortByKey();
//First run
long startTime1 = System.currentTimeMillis();
semisorted.saveAsTextFile("testData.txt" + "_output1", org.apache.hadoop.io.compress.DefaultCodec.class);
long runTime1 = System.currentTimeMillis() - startTime1;
//Second run
long startTime2 = System.currentTimeMillis();
semisorted.saveAsTextFile("testData.txt" + "_output2", org.apache.hadoop.io.compress.DefaultCodec.class);
long runTime2 = System.currentTimeMillis() - startTime2;
sc.stop();
spark-submit --master local[1] --class com.john.Test my.jar /user/john/testData.txt /user/john/testData_output
The output is:
runTime1 = 126 secs
runTime2 = 82 secs
How can such a large variation be for two (exactly same) jobs?