I'm running Scala application on InelliJ (spark 2.3, HDP 2.6.5).
I'm trying to read a parquet file from HDFS, and to run on it map operation, but it takes too long.
I have noticed that when my initial DataFrame is big, the whole map operation takes too long, even If I shrink the Dataframe.
Please look on the following code sample:
def main(args Array[String]):Unit = {
...
//first part - runs fast
println("Start: " + LocalDateTime.now())
val smallDf:DataFrame = sparkSession.read.parquet(hdfsSmallParquetPath)//for small parquet it returns 3000 rows
val collectedData1 = smallDf.map(runSomeMethod).collect()
println("End: " + LocalDateTime.now())
//second part - runs slow
println("Start: " + LocalDateTime.now())
val bigDf:DataFrame = sparkSession.read.parquet(hdfsBigParquetPath)//for big parquet it returns 3000000 rows
val smallerDf:DataFrame = bigDf.sample(0.001)// shrink it to return 3000 rows
val collectedData2 = smallerDf.map(runSomeMethod).collect()
println("End: " + LocalDateTime.now())
}
def runSomeMethod(r:Row):String = {
"abcd"
}
The first part runs on 3000 rows and takes ~1 second, and the second part also runs on 3000 rows but it takes ~150 seconds
How can I make the second part runs as fast as the first part?
Is there any cache()/persist() that can improve the performance?
Is there any different running on small Dataframe, and on big Dataframe that became small?
Thanks