0

eg. If my date column is load_date, using max(load_date) operator will scan every data file in hive making it a costly operation. Instead is there any optimal way to get the latest load_date from the table.

AbhishekB
  • 31
  • 1
  • 5

2 Answers2

3

Just thinking out of Hive,Assuming that you've configured mysql for your metastore instead of derby and the partition column load_date.

Suggestion 1: This query gives you all partition name. Do sub string (day=2020-05-24) and take date part out of it and cast it to date and then get the max value.

mysql> select PART_NAME FROM PARTITIONS WHERE TBL_ID=(SELECT TBL_ID FROM TBLS WHERE TBL_NAME='PARTITIONED_TABLE');

Suggestion 2: If you dont have mysql or access issue, then get the partition list out using below command and write a shell script which reads this file and give the max load_date.

$hive -e 'show partitions table;' > partitions.txt

Suggestion 3: In Hive, the way is, just to avoid full table scan. Here "-3" is variable, change it to last successful day load if you know it.

select max(load_date) from db.table_name
where load_date>date_add(current_date,-3)

Note: Not sure how you want to consume the output of max(load_date), you can always store the value in hive table and use in your query.

0

The only way in Hive to check max date is using max aggregation function. In your case its max(load_date).

Not only Hive, in any relational database, this is the approach to find the latest date. If required in your design you can have a metadata table, where it records the statistics of each file loaded in hive, this helps you to get the latest record

halfer
  • 19,824
  • 17
  • 99
  • 186
Jim Macaulay
  • 4,709
  • 4
  • 28
  • 53