0

I am trying to convert a Dataset to Iterator in a java program using toLocalIterator method. It costs 1000+ ms of elapsed time which is far higher than that when doing the same conversion in Scala.

I have tried to convert a Dataset of size 3 in both Java and Scala. Elapsed time for the conversion were around 1000ms and 6ms respectively.

//In Java
 Dataset<Row> dataset = sparkSession.read().parquet(parquetPath);
 Dataset<Row> datasetNew = dataset.select("col1"); // outputs "3053462", "3256790", "3269055"
 long st_od = System.currentTimeMillis();
 Iterator<Row> iterator = datasetNew.toLocalIterator()
 long et_od = System.currentTimeMillis();
 logger.info("Elapsed time for iterator conversion: " + (et_od - st_od) 
 + "ms");
//In Scala 
 val data = List("3053462", "3256790", "3269055")
 val df = spark.sparkContext.parallelize(data);

 val st = System.currentTimeMillis()
 val iter = df.toLocalIterator
 val et = System.currentTimeMillis()

 println("Time", (et - st))

I expected around 6ms in Java code as well, but it costs 1000+ms for the toLocalIterator operation. Anyone know the reason?

Aadhil Rushdy
  • 74
  • 1
  • 5
  • A) read, digest and apply https://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java B) both things should just be a simple method call in bytecode (so: compare bytecode) C) use a profiler ... figure where the system really spends its time. 1 second is **a lot** of time. – GhostCat May 29 '19 at 11:37
  • 3
    I think you're comparing apples and oranges here since you're not starting with the same data. Also, do not use System.currentTimeMillis for measuring runtime, it is vulnerable to clock skew and not very accurate, use System.nanoTime. – cbley May 29 '19 at 11:37
  • @cbley I am using the same data. In Scala val data = List("3053462", "3256790", "3269055") val df = spark.sparkContext.parallelize(data); In Java "col1" has same String vlaues – Aadhil Rushdy May 29 '19 at 11:39
  • 3
    @AadhilRushdy it is not obvious looking at your code, it still isn't. The Java code selects some column, which the Scala code does not. – cbley May 29 '19 at 11:45
  • @cbley The output of select statement in java is a single column of String type, since the time cost is high, I have tried the same thing in Scala creating a dataframe similar to the output of select statement in Java. I have changed the code to make it clear. – Aadhil Rushdy May 29 '19 at 11:47
  • @cbley I have tried using System.nanoTime and still it gives the same result – Aadhil Rushdy May 29 '19 at 11:48
  • 1
    Likely `toLocalIterator` in the first case has to go to the database and get the data from there (to make it available locally as the name says). In the second case it obviously doesn't. If you use the same `dataset.select` in Scala you should see the same result. – Alexey Romanov May 29 '19 at 12:19
  • @AlexeyRomanov changed the code in Java to make it more clear. Do you think still the same case? – Aadhil Rushdy May 30 '19 at 05:00
  • @AlexeyRomanov yes in the java code, for the RDD conversion it takes time. In the scala code, it is an RDD itself. – Aadhil Rushdy May 30 '19 at 05:47
  • @AadhilRushdy: In the Java case you read a parquet file, in Scala you create the dataframe yourself, if you want to compare the two then do the same thing. Spark is lazy and won't perform any operations before it's necessary, you do the timing only around the `toLocalIterator` in both cases but the Java will perform the additional file operations there as well. – Shaido May 30 '19 at 09:10

0 Answers0