1

Recently I came across this issue. I had a file at a Hadoop Distributed File System path and related hive table. The table had 30 partitions on both sides.

I deleted 5 partitions from HDFS and then executed "msck repair table <db.tablename>;" on the hive table. It completed fine but outputted

"Partitions missing from filesystem:"

I tried running select count(*) <db.tablename>; (on tez) it failed with the following error:

Caused by: java.util.concurrent.ExecutionException: java.io.FileNotFoundException:

But when I set hive.execution.engine as "mr" and executed "select count(*) <db.tablename>;" it worked fine without any issue.

I have two questions now :

  1. How is this is possible?

  2. How can I sync the hive metastore and an hdfs partition? For the above case .(My hive version is " Hive 1.2.1000.2.6.5.0-292 ".)

Thanks in advance for help.

jwezorek
  • 8,592
  • 1
  • 29
  • 46

1 Answers1

0

MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];

This will update metadata about partitions to the Hive metastore for partitions for which such metadata doesn't already exist. The default option for MSC command is ADD PARTITIONS. With this option, it will add any partitions that exist on HDFS but not in metastore to the metastore. The DROP PARTITIONS option will remove the partition information from metastore, that is already removed from HDFS. The SYNC PARTITIONS option is equivalent to calling both ADD and DROP PARTITIONS.

However, this is available only from Hive version 3.0.. See - HIVE-17824

In your case, the version is Hive 1.2, below are the steps to sync the HDFS Partitions and Table Partitions in Metastore.

  1. Drop the corresponding 5 partitions those have been removed by you from HDFS directly, using the below ALTER statement .

ALTER TABLE <db.table_name> DROP PARTITION (<partition_column=value>);

  1. Run SHOW PARTITIONS <table_name>; and see if the list of partitions are refreshed.

This should sync the partitions in HMS as in HDFS.

Alternatively, you can drop and recreate the table (IF it is an EXTERNAL table), perform MSCK REPAIR on the newly created table. Because dropping an external table will not delete the underlying data.

Note: By default, MSCK REPAIR will only add newly added partitions in HDFS to Hive Metastore and does not delete the Partitions from Hive Metastore those have been deleted in HDFS manually.

====

To avoid these steps in future, it is good to delete the partitions directly using ALTER TABLE <table_name> DROP PARTITION (<partition_column=value>) from Hive.

Gomz
  • 850
  • 7
  • 17
  • I am trying to write a script to get partition from HDFS and Hive Metastore , compare them find the difference and then trigger a script that will take each partitions from the the difference file and delete accordingly. But only blocker I am facing is I have to schedule this script every day . There can be many such tables with different databases. How can I work/ resolve this . Any suggestion will be appreciated . – Kamal Prasad Sep 03 '19 at 07:14
  • I believe that this is supposed to be a different question as a separate post. – Gomz Sep 03 '19 at 07:20
  • Actually the main question lead to this part question . So basically we cant upgrade to hive version 3. But we have a requirement to keep Hdfs filesystem and Hive metastore in sync . For that I am writing the script , need some help for forming this . – Kamal Prasad Sep 03 '19 at 13:34