28

I need to calculate the runtime of a code in scala. The code is.

val data = sc.textFile("/home/david/Desktop/Datos Entrada/household/household90Parseado.txt")

val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()

val numClusters = 5
val numIterations = 10 
val clusters = KMeans.train(parsedData, numClusters, numIterations)

I need to know the runtime to process this code, the time have to be on seconds.

user4157124
  • 2,809
  • 13
  • 27
  • 42
David Rebe Garcia
  • 435
  • 1
  • 4
  • 9
  • What do you mean by "run time"? The time from when you start the job to the end? The total CPU time used by all the workers? Something else? What will you use the result for? Given Spark has a fairly significant startup and tear down time, if your data set is small,. you'll mostly be timing that and not your actual computation. – The Archetypal Paul Jun 09 '16 at 16:55
  • @TheArchetypalPaul I need to calculate the time that the system takes to run Apache Spark KMeans in one dataset. I have a dataset of 2,000,000 data. First run KMeans with 10% of the dataset, then with 20%, etc ... When I run the algorithm I've seen that sometimes the runtime with 60% is less than 20% .. It this possible? – David Rebe Garcia Jun 09 '16 at 17:45
  • Depends. How many seconds are we talking about? Your actual run time may be swamped by variable factors around startup. Are all the machines being used entirely dedicated to just your job? – The Archetypal Paul Jun 09 '16 at 17:47
  • @TheArchetypalPaul I have only one virtual machine with 4GB the memory RAM. The times are.. 10%--24, 20%--88, 30%--83, 40%--89, 50%--86, 60%--63, 70%--65, 80%--71, 90%--78, 100%--71, all the times are in second, and it are calculated by `System.currentTimeMillis()`. – David Rebe Garcia Jun 09 '16 at 17:58
  • That seems odd, but I don't really know what's happening "under the hood". How are you setting up the SparkContext? Are you using all the cores of your machine? Possibly Spark decides to use more cores for bigger data sets? I'd review the logs to see if you can see what changes – The Archetypal Paul Jun 09 '16 at 18:10
  • @TheArchetypalPaul `object RunKMeans {def main(args: Array[String]) {val conf = new SparkConf() .setAppName("RunKMeans") .setMaster("local") val sc = new SparkContext(conf)val data = sc.textFile("/home/david/Desktop/Datos Entrada/household/household60Parseado.txt") val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache() val numClusters = 20 val numIterations = 1000 val clusters = KMeans.train(parsedData, numClusters, numIterations)` This is my code. Can you give me your email to send you the code? I am working whit Scala ID.. – David Rebe Garcia Jun 09 '16 at 18:20
  • "Can you give me your email to send you the code? " No, that's not appropriate. – The Archetypal Paul Jun 09 '16 at 19:44
  • @TheArchetypalPaul Sorry, I have not explained well. I wanted to say that I can not post all the code in the page, and I wanted you to see the full code. How I can see the logs? How I can configure the cores that are going to use? Thank you very much. – David Rebe Garcia Jun 09 '16 at 20:19
  • You need to read the Spark documentation SO isn't the place to answer everything you need to know. – The Archetypal Paul Jun 09 '16 at 20:21
  • @TheArchetypalPaul I read the Spark documentation. I set this properties to Apache Spark Context, but I have the same problem, the runtime with 50% of the dataset is older that with 60% of the dataset. The code is `.set("spark.driver.cores","4") .set("spark.driver.maxResultSize","1g") .set("spark.driver.memory","1g") .set("spark.executor.memory","1g")` – David Rebe Garcia Jun 10 '16 at 15:42
  • What happens when you set the cores to 1? I'm wondering if it's that Spark at some poinnt decides to use another partition so using another core and getting more parallelism. But I'm not really expert at this level of Spark – The Archetypal Paul Jun 10 '16 at 15:44

6 Answers6

65

Based on discussion here, you'll want to use System.nanoTime to measure the elapsed time difference:

val t1 = System.nanoTime

/* your code */

val duration = (System.nanoTime - t1) / 1e9d
evan.oman
  • 5,922
  • 22
  • 43
  • 7
    As it is a nano second we need to divide it by 1000000000. in 1e9d "d" stands for double. (My aim is just to give some clarifications regarding 1e9d) – S12000 Nov 16 '16 at 14:17
  • 2
    Yep, `1e9d` is `10^9` as a double. We want a double so that the result will be a double rather than a long (integer division vs double division) – evan.oman Nov 16 '16 at 14:21
  • 2
    `scala> 1 / 3 res0: Int = 0` vs `scala> 1 / 3d res1: Double = 0.333...` – evan.oman Nov 16 '16 at 14:24
  • Thanks evan058. It adds usefull infos. – S12000 Nov 16 '16 at 14:29
  • 2
    What about using timeunit like TimeUnit.NANOSECONDS.toMinutes(total) instead of that manual division? – Yeikel Jan 13 '19 at 19:00
12

The most basic approach would be to simply record the start time and end time, and do subtraction.

val startTimeMillis = System.currentTimeMillis()

/* your code goes here */

val endTimeMillis = System.currentTimeMillis()
val durationSeconds = (endTimeMillis - startTimeMillis) / 1000
Larsenal
  • 49,878
  • 43
  • 152
  • 220
  • 6
    I think I read that you want to use `System.nanoTime` rather than `System.currentTimeMillis()`. – evan.oman Jun 09 '16 at 15:59
  • 2
    If he just wants to measure wall time to the nearest second, I personally wouldn't sweat it. Based on the question itself, I'm guessing the intent isn't to do intense, _high-precision_ profiling. – Larsenal Jun 09 '16 at 18:30
  • 2
    From the answer: 'The purpose of nanoTime is to measure elapsed time, and the purpose of currentTimeMillis is to measure wall-clock time. You can't use the one for the other purpose ... You may say, "this doesn't sound like it would ever really matter that much," to which I say, maybe not, but overall, isn't correct code just better than incorrect code? Besides, nanoTime is shorter to type anyway.' – evan.oman Jun 09 '16 at 18:33
  • 1
    I appreciate the thought, but in the scenario described in the question, it really won't matter. If this question about `nanoTime` vs `currentTimeMillis` was asked by a junior developer sitting next to me, I'd tell him to not sweat it and worry about more important things. – Larsenal Jun 09 '16 at 18:41
  • 1
    Sure, there are more important things, but why not use the more correct version if there is absolutely no cost to switch to it? – evan.oman Jun 09 '16 at 18:43
12

Starting from Spark2+ we can use spark.time(<command>)(only in scala until now) to get the time taken to execute the action/transformation..

Example:

Finding count of records in a dataframe

scala> spark.time(
                 sc.parallelize(Seq("foo","bar")).toDF().count() //create df and count
                 )
Time taken: 54 ms //total time for the execution
res76: Long = 2  //count of records
notNull
  • 30,258
  • 4
  • 35
  • 50
7
  • Case : Before spark 2.1.0

< Spark 2.1.0 explicitly you can use this function in your code to measure time in milli seconds

/**
   * Executes some code block and prints to stdout the time taken to execute the block. This is
   * available in Scala only and is used primarily for interactive testing and debugging.
   *
   */
  def time[T](f: => T): T = {
    val start = System.nanoTime()
    val ret = f
    val end = System.nanoTime()
     println(s"Time taken: ${(end - start) / 1000 / 1000} ms")
     ret
  }

Usage :

  time {
    Seq("1", "2").toDS().count()
  }
//Time taken: 3104 ms
  • Case : After spark 2.1.0

>= Spark 2.1.0 There is a built in function given in SparkSession

you can use spark.time

Usage :

  spark.time {
    Seq("1", "2").toDS().count()
  }
//Time taken: 3104 ms
Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
6

You can use scalameter: https://scalameter.github.io/

Just put your block of code in the brackets:

val executionTime = measure {
  //code goes here
}

You can configure it to warm-up the jvm so the measurements will be more reliable:

val executionTime = withWarmer(new Warmer.Default) measure {
  //code goes here
}
fr3ak
  • 493
  • 3
  • 16
1

this would be the best way to do calculate time for scala code.

def time[R](block: => (String, R)): R = {
    val t0 = System.currentTimeMillis()
    val result = block._2
    val t1 = System.currentTimeMillis()
    println(block._1 + " took Elapsed time of " + (t1 - t0) + " Millis")
    result
 }

 result = kuduMetrics.time {
    ("name for metric", your function call or your code)
 }