1

Let's say I have a hive table partitioned by date with its data stored in S3 as Parquet files. Let's also assume that for a particular partition (date), there were originally 20 records.

If I then delete the original files and put new Parquet files with 50 records in the same folder, do I need to drop and recreate that partition for the new data to reflect?

My understanding was that we don't have to recreate partitions. So I tried removing old data from the respective folder and keeping the new data without "updating" the Hive partition. However, then when I took count(*) for that date, it still showed as 20 records instead of 50. Upon dropping and creating the partition again, it started showing the correct count. Is that the expected behavior?

leftjoin
  • 36,950
  • 8
  • 57
  • 116
pallupz
  • 793
  • 3
  • 9
  • 25

1 Answers1

1

Hive optimizes simple queries like select count(*) using statistics. If this property is set:

set hive.compute.query.using.stats=true;

Then Hive will take count from statistics stored in the metadata.

When you replaced file with new one, statistics remains the same. When you deleted partition, all related statistics was also deleted, this is why you got correct count after re-creating partition.

See also this answer: HIVE select count() non null returns higher value than select count() - Predicate prevents statistics usage in this case.

This behavior is quite expected. You can either

set hive.compute.query.using.stats=false;

To switch-off statistics usage for query result calculation, your partition recreation effectively does the same because it removed statistics, this is why statistics was not used and file was scanned.

Or you can analyze table to have statistics updated and keep above parameter set true, so next time you will perform simple aggregation, it will work fast:

ANALYZE TABLE tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)]  
  COMPUTE STATISTICS

For small file with 50 records only performance difference is not so big. But better to have statistics updated, it also being used by optimizer to build optimal query plan.

More details here: analyze table

And if you inserting data using INSERT OVERWRITE, you can enable statistics auto-gathering:

set hive.stats.autogather=true;
leftjoin
  • 36,950
  • 8
  • 57
  • 116