1

Problem statement :- I have a original external table with table count(1000) by copying its underlying data to some temp location and when created backup table pointing to that temp location. And after running the msck repair the both table counts are not matching?

Is there any reason for it. Could you please help me in understanding the reason behind it .

Never_Give_Up
  • 126
  • 1
  • 9
  • 1
    How many records are mismatching?? What is the DDL used? – saravanatn Sep 20 '20 at 05:41
  • I used the same DDL taken from the show create table name and count difference is ~600 more records. – Never_Give_Up Sep 20 '20 at 09:56
  • 1
    It can be caused by using wrong statistics: https://stackoverflow.com/a/41021682/2700344 – leftjoin Sep 20 '20 at 13:45
  • 1
    @leftjoin .. Even though the counts are different initially ,After processing the counts matched ideally even though counts not matched initially there is no issue. It make sense to an extent. Thank you. – Never_Give_Up Sep 20 '20 at 14:23

1 Answers1

1

Answering and clarifying few things here,

  1. Stats can be fetched either directly from Metastore or by reading through the underlying data. It can be controlled by the property hive.compute.query.using.stats

    a. When it is set to TRUE, Hive will answer a few queries like min, max, and count(1) purely using statistics stored in the metastore.

    b. When it is set to FALSE, Hive will spawn a YARN job to read through the data and provide the count results. It is usually time consuming based on the amount of data since this is not a direct fetch from the statistics stored in Hive Metastore.

    So, if we want the correct statistics to be returned in the results when the property hive.compute.query.using.stats is set to TRUE, we need to make sure the statistics for the table is updated.

    You can check if the value is set to TRUE or FALSE by running the below in Hive,

    SET hive.compute.query.using.stats;

  2. MSCK REPAIR does not do the file level checks. It looks only for directory level changes, for example if you have created a partitioned table and added a partition directory manually in HDFS and if you want Hive to be aware of it, MSCK REPAIR would serve the purpose.

Gomz
  • 850
  • 7
  • 17
  • /@leftjoin ... Can i know why insert overwrite is not overwriting the NULL values with int type. Any idea how to solve this issue.. – Never_Give_Up Sep 22 '20 at 13:09