15

I need to measure the execution time of query on Apache spark (Bluemix). What I tried:

import time

startTimeQuery = time.clock()
df = sqlContext.sql(query)
df.show()
endTimeQuery = time.clock()
runTimeQuery = endTimeQuery - startTimeQuery

Is it a good way? The time that I get looks too small relative to when I see the table.

YAKOVM
  • 9,805
  • 31
  • 116
  • 217

7 Answers7

24

To do it in a spark-shell (Scala), you can use spark.time().

See another response by me: https://stackoverflow.com/a/50289329/3397114

df = sqlContext.sql(query)
spark.time(df.show())

The output would be:

+----+----+
|col1|col2|
+----+----+
|val1|val2|
+----+----+
Time taken: xxx ms

Related: On Measuring Apache Spark Workload Metrics for Performance Troubleshooting.

Tyrone321
  • 1,702
  • 15
  • 23
14

I use System.nanoTime wrapped around a helper function, like this -

def time[A](f: => A) = {
  val s = System.nanoTime
  val ret = f
  println("time: "+(System.nanoTime-s)/1e6+"ms")
  ret
}

time {
  df = sqlContext.sql(query)
  df.show()
}
shridharama
  • 949
  • 11
  • 18
7

Update: No, using time package is not the best way to measure execution time of Spark jobs. The most convenient and exact way I know of is to use the Spark History Server.

On Bluemix, in your notebooks go to the "Paelette" on the right side. Choose the "Evironment" Panel and you will see a link to the Spark History Server, where you can investigate the performed Spark jobs including computation times.

Sven Hafeneger
  • 801
  • 6
  • 13
  • I know the OP accepted the answer, but strangely enough it doesn't literally answer his question i.e., using time.clock() to measure the query execution time. I had the same question, that's why I ended up here, but at the end there is no answer. – Nadjib Mami Oct 19 '16 at 07:53
  • @nadjib-mami Ops, good point, missed the simple "No" and went directly to the solution :) Thanks! – Sven Hafeneger Nov 07 '18 at 08:16
  • It still didn't answer, why using `time` is not the best way to measure – Sairam Krish Sep 13 '21 at 05:47
4

SPARK itself provides much granular information about each stage of your Spark Job.

You can view your running job on http://IP-MasterNode:4040 or You can enable History server for analyzing the jobs at a later time.

Refer here for more info on History server.

Sumit
  • 1,400
  • 7
  • 9
  • 2
    The OP is asking about the Apache Spark Service on Bluemix, so not running their own spark cluster under their own control; e.g. it does not expose the ui on 4040. – Randy Horman Apr 29 '16 at 12:07
1

If you are using spark-shell (scala) you can use the time module:

import time
df = sqlContext.sql(query)
spark.time(df.show())

However, SparkSession.time() is not available in pyspark. For python, a simple solution would be to use time:

import time
start_time = time.time()
df.show()
print(f"Execution time: {time.time() - start_time}")
Amir Charkhi
  • 768
  • 7
  • 23
0

You can also try using sparkMeasure which simplify the collection of performance metrics

Guy
  • 124
  • 6
  • While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](/review/late-answers/31454863) – Andrew Halil Apr 06 '22 at 12:48
-2

For those looking for / needing a python version
(as pyspark google search leads to this post) :

from time import time
from datetime import timedelta

class T():
    def __enter__(self):
        self.start = time()
    def __exit__(self, type, value, traceback):
        self.end = time()
        elapsed = self.end - self.start
        print(str(timedelta(seconds=elapsed)))

Usage :

with T():
    //spark code goes here

As inspired by : https://blog.usejournal.com/how-to-create-your-own-timing-context-manager-in-python-a0e944b48cf8

Proved useful when using console or whith notebooks (jupyter magic %%time an %timeit are limited to cell scope, which is inconvenient when you have shared objects across notebook context)

E_net4
  • 27,810
  • 13
  • 101
  • 139
Mehdi LAMRANI
  • 11,289
  • 14
  • 88
  • 130