1

I have spark conf as:

sparkConf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")    
sparkConf.set("hive.exec.dynamic.partition", "true")
sparkConf.set("hive.exec.dynamic.partition.mode", "nonstrict")

I am using the spark context to write the parquet files into hdfs location as:

df.write.partitionBy('asofdate').mode('append').parquet('parquet_path')

In hdfs location, the parquet files are stored as 'asofdate' but in hive table I have to do 'MSCK REPAIR TABLE <tbl_name>' everyday. I am looking for a solution on how I can do recover table for every new partitions using spark script (or at the time of partition creation itself).

gd1
  • 655
  • 1
  • 10
  • 21

1 Answers1

0

It's better if you integrate hive with spark to make your job easier.

After the hive-spark integration setup, you can enable hive support while creating SparkSession.

  spark = SparkSession.builder.enableHiveSupport().getOrCreate()

Now you can access hive tables from spark. You can run repair command from spark itself.

spark.sql("MSCK REPAIR TABLE <tbl_name>")

I would suggest to write dataframe directly as a hive table instead of writing it to parquet and do repair table.

df.write.partitionBy("<partition_column>").mode("append").format("parquet").saveAsTable("<table>")
Mohana B C
  • 5,021
  • 1
  • 9
  • 28
  • I have streaming service that generates 100> parquet files every 1 hour. So, it might be overhead running repair table every hour. And I need to save parquet files to the hdfs location too. – gd1 Aug 17 '21 at 17:41