How to properly measure elapsed time in Spark?

Question

I have my code written in Spark and Scala. Now I need to measure elapsed time of particular functions of the code.

Should I use spark.time like this? But then how can I properly assign the value of df?

val df = spark.time(myObject.retrieveData(spark, indices))

Or should I do it in this way?

def time[R](block: => R): R = {
    val t0 = System.nanoTime()
    val result = block    // call-by-name
    val t1 = System.nanoTime()
    println("Elapsed time: " + (t1 - t0) + "ns")
    result
}

val df = time{myObject.retrieveData(spark, indices)}

Update:

As recommended in comments, I run df.rdd.count inside myObject.retrieveData in order to materialise the DataFrame.

does your functions returns a dataframe? If yes, you need to call an action on it which fully materializes the dataframe — Raphael Roth, May 31 '18 at 07:36
@RaphaelRoth: Yes, my function `retrieveData` returns a DataFrame. I call `count()` inside this function. My question is if I can assign DataFrame to `df`if I measure elapsed time this way? How can I add a custom print, e.g. "Elapsed time for retrieveData" ? — ScalaBoy, May 31 '18 at 07:41
`count` is not enough, you should use `rdd.count` (see https://stackoverflow.com/questions/42714291/how-to-force-dataframe-evaluation-in-spark). But you sould not include that in your production code, as you will materialize your dataframe twice in this case. In keep in mind that this time will only be an estimate, because the query-plan may change (will be optimized) after you add additional transformations. — Raphael Roth, May 31 '18 at 07:44
@RaphaelRoth: Ok, see my update. After adding `df.rdd.count`, which approach should I use to make the estimation as precise as possible? — ScalaBoy, May 31 '18 at 07:48

How to properly measure elapsed time in Spark?

0 Answers0