0

I was running spark sql on Yarn and I met the same issue like below link: Spark: long delay between jobs

There's a long delay post the action which was saving table. On Spark UI, I could see the particular saveAsTable() job was completed but there's no any new job was submitted. spark ui screenshot

In the first link, the answer said I/O operations will occur on master node but I doubt that.

At the gap time, I checked hdfs where I saved the tables, then I could see _temporary file rather than _success file. it looks like the answer is truth and spark was saving table on driver end. Why?!!

I'm using below code to save table:

dataframe.write.partitionBy(partitionColumn)).format(format)
.mode(SaveMode.Overwrite)
.saveAsTable(s"$tableName")

BTW, the format is orc format file. anyone can give me some suggestions? :) thx in advance.

Sailendra
  • 1,318
  • 14
  • 29
XiBaibai
  • 1
  • 1
  • not sure if my change that increased spark.sql.shuffle.partitions to 2000 where's 200 by default resulted in this issue. – XiBaibai Nov 01 '19 at 08:03

1 Answers1

0

Spark SQL - Difference between df.repartition and DataFrameWriter partitionBy? As per above link, partitionBy is used to partition data on disk. so this process cannot be monitored on Spark UI. I increased the partitions numbers before calling partitionBy(), then too much files would be generated, which caused the delay. I think that's the truth.

XiBaibai
  • 1
  • 1