Hive table requires 'repair' for every new partitions while inserting parquet files using pyspark

Question

I have spark conf as:

sparkConf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")    
sparkConf.set("hive.exec.dynamic.partition", "true")
sparkConf.set("hive.exec.dynamic.partition.mode", "nonstrict")

I am using the spark context to write the parquet files into hdfs location as:

df.write.partitionBy('asofdate').mode('append').parquet('parquet_path')

In hdfs location, the parquet files are stored as 'asofdate' but in hive table I have to do 'MSCK REPAIR TABLE <tbl_name>' everyday. I am looking for a solution on how I can do recover table for every new partitions using spark script (or at the time of partition creation itself).

score 0 · Answer 1 · answered Aug 17 '21 at 17:19

It's better if you integrate hive with spark to make your job easier.

After the hive-spark integration setup, you can enable hive support while creating SparkSession.

  spark = SparkSession.builder.enableHiveSupport().getOrCreate()

Now you can access hive tables from spark. You can run repair command from spark itself.

spark.sql("MSCK REPAIR TABLE <tbl_name>")

I would suggest to write dataframe directly as a hive table instead of writing it to parquet and do repair table.

df.write.partitionBy("<partition_column>").mode("append").format("parquet").saveAsTable("<table>")

I have streaming service that generates 100> parquet files every 1 hour. So, it might be overhead running repair table every hour. And I need to save parquet files to the hdfs location too. — gd1, Aug 17 '21 at 17:41

Hive table requires 'repair' for every new partitions while inserting parquet files using pyspark

1 Answers1