Interesting performance variation in benchmarking Spark code

Question

I am calculating how long it take to complete a particular job on Spark, in this case how long it takes to save an output RDD. The saving of the RDD involves compressing it.

What is weird is that first execution of the code is always slower, compared to a second execution of exactly same piece of code. How can this be?

The Spark program looks like the following:

JavaPairRDD<String, String> semisorted = wordsAndChar.sortByKey();

//First run
long startTime1 = System.currentTimeMillis();
semisorted.saveAsTextFile("testData.txt" + "_output1", org.apache.hadoop.io.compress.DefaultCodec.class);
long runTime1 = System.currentTimeMillis() - startTime1;

//Second run
long startTime2 = System.currentTimeMillis();
semisorted.saveAsTextFile("testData.txt" + "_output2", org.apache.hadoop.io.compress.DefaultCodec.class);
long runTime2 = System.currentTimeMillis() - startTime2;

sc.stop();

spark-submit --master local[1] --class com.john.Test my.jar /user/john/testData.txt /user/john/testData_output

The output is:

runTime1 = 126 secs

runTime2 = 82 secs

How can such a large variation be for two (exactly same) jobs?

RDDs are lazy. The first run is probably cached in memory for the second and subsequent runs — OneCricketeer, Dec 10 '16 at 18:57
Also, two runs on a single machine isn't much of a benchmark — OneCricketeer, Dec 10 '16 at 18:58
@cricket_007, "two runs", because the first run took too long. I have even had 3 and more runs. In each case, first run was slowest. This is just "testing" --need a reliable number from 1 machine before going to compare with somebody's machine. — nikk, Dec 10 '16 at 19:04
I don't know know you've created `wordsAndChar`. Did you cache or persist it? What happens if you unpersist and recreate the RDD between runs? — OneCricketeer, Dec 10 '16 at 19:12

score 3 · Answer 1 · edited May 23 '17 at 12:30

These two jobs are not the same. Any shuffle operation, including sortByKey, creates shuffle files which are serve as an implicit caching point.

When you execute the first job it has perform a full shuffle and all preceding operations.
When you execute the second job it can reads shuffle files and execute only the last stage.

You should see skipped stages in the Spark UI which correspond to this behavior.

There is also another source of variation that can contribute here but its impact should be smaller. Many context related objects in Spark are lazily initialized. These will be initialized during the first job.

In general if you want to monitor performance you can use:

Spark UI for manual inspection.
Spark REST API (/applications/[app-id]/stages/[stage-id] is particularly useful) or TaskListener to get detailed statistics.

You should also:

Perform multiple runs and adjust result to correct for initialization process.
unpersist objects if required.
Avoid possible confounders (for example multiple jobs executed by the same application are not independent and can be affected by a number of factors like cache eviction or GC).

So if this is an inherent behavior in Spark, between jobs, say I wanted to accurately calculate and compare the execution times of those 2 pieces of code. How to do that? Imagine in the second, instead of using `DefaultCodec`, I used `SnappyCodec`. — nikk, Dec 10 '16 at 22:26

Interesting performance variation in benchmarking Spark code

1 Answers1