3

While inserting from Hive table to HIve table, It is loading more records that actual records. Can anyone help in this weird behaviour of Hive ?

My query would be looking like this:

insert overwrite table_a
    select col1,col2,col3,... from table_b;

My table_b consists of 6405465 records.

After inserting from table_b to table_a, i found total records in table_a are 6406565.

Can any one please help here ?

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • are they stored differently? maybe you have `table_a` line delimiters in some of the data – Alex Libov Dec 04 '16 at 13:46
  • Yes, table_b is normal text table. where as table_a is partitioned text table. And both the tables are having line delimeted by '\n', feilds delimeted by '\t' – Rajesh Kumar Dec 04 '16 at 15:34
  • 1
    Also see this: http://stackoverflow.com/a/39914232/2700344 Check if you are using stats for count(*) computation – leftjoin Dec 05 '16 at 08:45
  • 1
    After making hive.compute.query.using.stats=false it is showing the same count of records. But can you explain why it is happened before with ture option ? – Rajesh Kumar Dec 05 '16 at 16:15
  • Explained. See my answer please. – leftjoin Dec 07 '16 at 18:42

1 Answers1

3

If hive.compute.query.using.stats=true; then optimizer is using statistics for query calculation instead of querying table data. This is much faster because metastore is a fast database like MySQL and does not require map-reduce. But statistics can be not fresh (stale) if the table was loaded not using INSERT OVERWRITE or configuration parameter hive.stats.autogather responsible for statistics auto gathering was set to false. Also statistics will be not fresh after loading files or after using third-party tools. It's because files was never analyzed, statistics in metastore is not fresh, if you have put new files, nobody knows about how the data was changed. Also after sqoop loading, etc. So, it's a good practice to gather statistics for table or partition after loading using 'ANALYZE TABLE ... COMPUTE STATISTICS'.

In case it's impossible to gather statistics automatically (works for INSERT OVERWRITE) or by running ANALYZE statement then better to switch off hive.compute.query.using.stats parameter. Hive will query data instead of using statistics.

See this for reference: https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-StatisticsinHive

leftjoin
  • 36,950
  • 8
  • 57
  • 116