I am trying to convert a Dataset to Iterator in a java program using toLocalIterator method. It costs 1000+ ms of elapsed time which is far higher than that when doing the same conversion in Scala.
I have tried to convert a Dataset of size 3 in both Java and Scala. Elapsed time for the conversion were around 1000ms and 6ms respectively.
//In Java
Dataset<Row> dataset = sparkSession.read().parquet(parquetPath);
Dataset<Row> datasetNew = dataset.select("col1"); // outputs "3053462", "3256790", "3269055"
long st_od = System.currentTimeMillis();
Iterator<Row> iterator = datasetNew.toLocalIterator()
long et_od = System.currentTimeMillis();
logger.info("Elapsed time for iterator conversion: " + (et_od - st_od)
+ "ms");
//In Scala
val data = List("3053462", "3256790", "3269055")
val df = spark.sparkContext.parallelize(data);
val st = System.currentTimeMillis()
val iter = df.toLocalIterator
val et = System.currentTimeMillis()
println("Time", (et - st))
I expected around 6ms in Java code as well, but it costs 1000+ms for the toLocalIterator operation. Anyone know the reason?