eg. If my date column is load_date, using max(load_date) operator will scan every data file in hive making it a costly operation. Instead is there any optimal way to get the latest load_date from the table.
2 Answers
Just thinking out of Hive,Assuming that you've configured mysql for your metastore instead of derby and the partition column load_date.
Suggestion 1: This query gives you all partition name. Do sub string (day=2020-05-24) and take date part out of it and cast it to date and then get the max value.
mysql> select PART_NAME FROM PARTITIONS WHERE TBL_ID=(SELECT TBL_ID FROM TBLS WHERE TBL_NAME='PARTITIONED_TABLE');
Suggestion 2: If you dont have mysql or access issue, then get the partition list out using below command and write a shell script which reads this file and give the max load_date.
$hive -e 'show partitions table;' > partitions.txt
Suggestion 3: In Hive, the way is, just to avoid full table scan. Here "-3" is variable, change it to last successful day load if you know it.
select max(load_date) from db.table_name
where load_date>date_add(current_date,-3)
Note: Not sure how you want to consume the output of max(load_date), you can always store the value in hive table and use in your query.

- 374
- 2
- 8
The only way in Hive to check max date is using max aggregation function. In your case its max(load_date)
.
Not only Hive, in any relational database, this is the approach to find the latest date. If required in your design you can have a metadata table, where it records the statistics of each file loaded in hive, this helps you to get the latest record

- 19,824
- 17
- 99
- 186

- 4,709
- 4
- 28
- 53