1

When I write data from dataframe into parquet table ( which is partitioned ) after all the tasks are successful, process is stuck at updating partition stats.

16/10/05 03:46:13 WARN log: Updating partition stats fast for: 
16/10/05 03:46:14 WARN log: Updated size to 143452576
16/10/05 03:48:30 WARN log: Updating partition stats fast for: 
16/10/05 03:48:31 WARN log: Updated size to 147382813
16/10/05 03:51:02 WARN log: Updating partition stats fast for: 



df.write.format("parquet").mode("overwrite").partitionBy(part1).insertInto(db.tbl)

My table has > 400 columns and > 1000 partitions. Please let me know if we can optimize and speedup updating partition stats.

Karthik
  • 1,801
  • 1
  • 13
  • 21

1 Answers1

1

I feel the problem here is there are too many partitions for a > 400 columns file. Every time you overwrite a table in hive , the statistics are updated. IN your case it will try to update statistics for 1000 partitions and again each partition has data with > 400 columns.

Try reducing the number of partitions (use another partition column or if it is a date column consider partitioning by month) and you should be able to see a significant change in performance.

  • I have the same issue and I'm not specifying any partition column when writing on disk. I'm didn't understand what stats are being calculated? Could you explain more? – HHH Feb 03 '17 at 15:27