0

I'm running Scala application on InelliJ (spark 2.3, HDP 2.6.5).

I'm trying to read a parquet file from HDFS, and to run on it map operation, but it takes too long.

I have noticed that when my initial DataFrame is big, the whole map operation takes too long, even If I shrink the Dataframe.

Please look on the following code sample:

def main(args Array[String]):Unit = {
  ...
  //first part - runs fast
  println("Start: " + LocalDateTime.now())      
  val smallDf:DataFrame = sparkSession.read.parquet(hdfsSmallParquetPath)//for small parquet it returns 3000 rows
  val collectedData1 = smallDf.map(runSomeMethod).collect()      
  println("End: " + LocalDateTime.now())


  //second part - runs slow
  println("Start: " + LocalDateTime.now())      
  val bigDf:DataFrame = sparkSession.read.parquet(hdfsBigParquetPath)//for big parquet it returns 3000000 rows
  val smallerDf:DataFrame = bigDf.sample(0.001)// shrink it to return 3000 rows
  val collectedData2 = smallerDf.map(runSomeMethod).collect()      
  println("End: " + LocalDateTime.now())


}

def runSomeMethod(r:Row):String = {
  "abcd"
}

The first part runs on 3000 rows and takes ~1 second, and the second part also runs on 3000 rows but it takes ~150 seconds

How can I make the second part runs as fast as the first part?

Is there any cache()/persist() that can improve the performance?

Is there any different running on small Dataframe, and on big Dataframe that became small?

Thanks

Nir
  • 601
  • 7
  • 21
  • 1
    It's not an exact duplicate but bottom line, *spark is lazy*. – eliasah Mar 20 '19 at 15:14
  • You can also check the following answers : https://stackoverflow.com/questions/35356372/spark-is-taking-too-much-time-and-creating-thousands-of-jobs-for-some-tasks/35359670#35359670 and https://stackoverflow.com/questions/33187145/pyspark-with-elasticsearch/33190022#33190022 – eliasah Mar 20 '19 at 15:15
  • Thanks @eliasah. Should I expect different performance when running on small DataFrame, and on big Dataframe that became small (with the same size as the first DF)? – Nir Mar 20 '19 at 15:37
  • everything depends on the whether your data source enables pushdown predicates or not, etc. But anyhow sampling isn't as "cheap" as you might think. More on sampling [here](https://stackoverflow.com/questions/50004006/spark-sample-is-too-slow). – eliasah Mar 20 '19 at 15:39

0 Answers0