3

I have a view that works to pull the most recent data for a Hive history table. The history table is partitioned by day. The way that the view works is very straightforward—it has a subquery that does a max date on the date field (the one that is used as the partition) then filters the table based upon that value. The table contains hundreds of days (partitions), each with many millions of rows. In order to speed up the subquery, I am attempting to limit the partitions that are scanned to the last one created. To account for holiday weekends, I'm going back four days to ensure that the query returns data.

If I hard code the values with dates, the subquery runs very fast, and limits to the partitions correctly.

However, if I attempt to limit the partitions with a subquery to calculate the last partition, it doesn’t recognize the partitions and does a full table scan. The query will return correct results, as the filter works, but it takes a long time because it is not limiting the partitions scanned.

I tried doing the subquery as a WITH statement, then using an INNER JOIN on bus_date, but got the same results—partitions were not utilized.

The behavior is repeatable via a query, so I’ll use that rather than the view to demonstrate:

SELECT *
  FROM a.transactions
 WHERE bus_date IN (SELECT MAX (bus_date)
                      FROM a.transactions maxtrans
                     WHERE bus_date >= date_sub (CURRENT_DATE, 4));

There are no error messages, and the query actually works (filters to pull the correct data), but it scans all partitions so it is extremely slow. How can I limit the query to utilize the partitions identified in the subquery?

Jeff E
  • 31
  • 2
  • 1
    Read these related answers: https://stackoverflow.com/a/56940800/2700344 and this https://stackoverflow.com/a/56963448/2700344 and this https://stackoverflow.com/a/53279839/2700344 – leftjoin Sep 12 '19 at 18:50
  • Will give these a try. Thanks! – Jeff E Sep 13 '19 at 13:05
  • These are interesting but not exactly the same issue. It appears that Hive will not prune the partitions if the where criteria on the partition column comes from a subquery.. – Jeff E Sep 16 '19 at 19:48
  • You need to pass your subquery result as parameter as shown above in link... I did test same for spark sql. https://stackoverflow.com/questions/40663066/how-to-load-only-the-data-of-the-last-partition/57440645#57440645 – vikrant rana Sep 17 '19 at 17:38
  • Vikrant, it doesn't appear that this could be done with Hive SQL alone. I need to put this in place for a view. Is that possible? – Jeff E Sep 18 '19 at 19:00
  • I found this extremely helping: https://stackoverflow.com/questions/68465153/hive-read-table-partitions-defined-in-subselect. Under most of the time, no, you cannot prune partitions unless you are using hive on tez. – CKLu May 10 '22 at 09:15

1 Answers1

0

I'm still hopeful that someone will have an answer for this, but I did want to post the workaround that I've come up with in case it is useful for someone else.

SELECT * FROM a.transactions WHERE bus_date >= date_sub (CURRENT_DATE, 4) AND bus_date IN (SELECT MAX (bus_date) FROM a.transactions maxtrans WHERE bus_date >= date_sub (CURRENT_DATE, 4));

The query is a little clumsy, as it is filtering on the business date twice. The first time it limits the main set of data to the last four days (which limits to those partitions and avoids a scan of all partitions) and the second pins it down to the last day for which data has been loaded (via the MAX bus_date). This is far from perfect, but performs CONSIDERABLY better than the query scanning all partitions. Thanks.

Jeff E
  • 31
  • 2