I have a directory in HDFS, everyday one processed file is placed in that directory with DateTimeStamp in file name, if I create external table on top of that Directory location, does external table refreshes itself when every day file comes and resides in that directory ??
1 Answers
If you add files into table directory or partition directory, does not matter, external or managed table in Hive, the data will be accessible for queries, you do not need to do any additional steps to make data available, no refresh is necessary.
Hive table/partition is a metadata (DDL, location, statistics, access permissions, etc) plus data files in the location. So, data is stored in the table/partition location in HDFS.
Only if you create new directory for new partition which is not created yet, then you will need to execute ALTER TABLE ADD PARTITION LOCATION=<new location>
or MSCK REPAIR TABLE
command. The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is: ALTER TABLE table_name RECOVER PARTITIONS
.
If you add files into already created table/partition locations, no refresh is necessary.
CBO can use statistics for query calculation without reading data files, for example count(*)
. It works for simple queries only, like count(*), max().
If you are using CBO with statistics for query calculation, you may need to refresh it using ANALYZE TABLE hive_table PARTITION(partitioned_col) COMPUTE STATISTICS
. See this answer for more details: https://stackoverflow.com/a/39914232/2700344
If you do not need statistics and want your table location to be scanned every time you query it, switch it off: set hive.compute.query.using.stats=false;

- 36,950
- 8
- 57
- 116
-
Yes thank you, I tried to add a file to that pointed directory, the table shows new set of rows, and this shows I don't need to do follow any other steps as u said !!! – Shivpe_R Sep 01 '18 at 15:27