3

When performing a join in Hive and then filtering the output with a where clause, the Hive compiler will try to filter data before the tables are joined. This is known as predicate pushdown (http://allabouthadoop.net/what-is-predicate-pushdown-in-hive/)

For example:

SELECT * FROM a JOIN b ON a.some_id=b.some_other_id WHERE a.some_name=6

Rows from table a which have some_name = 6 will be filtered before performing the join, if push down predicates are enabled(hive.optimize.ppd).

However, I have also learned recently that there is another way of filtering data from a table before joining it with another table(https://vinaynotes.wordpress.com/2015/10/01/hive-tips-joins-occur-before-where-clause/).

One can provide the condition in the ON clause, and table a will be filtered before the join is performed

For example:

SELECT * FROM a JOIN b  ON a.some_id=b.some_other_id AND a.some_name=6

Do both of these provide the predicate pushdown optimization?

Thank you

leftjoin
  • 36,950
  • 8
  • 57
  • 116
vi_ral
  • 369
  • 4
  • 19

1 Answers1

4

Both are valid and in case of INNER JOIN and PPD both will work the same. But these methods works differently in case of OUTER JOINS

ON join condition works before join.

WHERE is applied after join.

Optimizer decides is Predicate push-down applicable or not and it may work, but in case of LEFT JOIN for example with WHERE filter on right table, the WHERE filter

SELECT * FROM a 
             LEFT JOIN b ON a.some_id=b.some_other_id 
 WHERE b.some_name=6 --Right table filter

will restrict NULLs, and LEFT JOIN will be transformed into INNER JOIN, because if b.some_name=6, it cannot be NULL.

And PPD does not change this behavior.

You can still do LEFT JOIN with WHERE filter if you add additional OR condition allowing NULLs in the right table:

SELECT * FROM a 
             LEFT JOIN b ON a.some_id=b.some_other_id 
 WHERE b.some_name=6 OR b.some_other_id IS NULL --allow not joined records

And if you have multiple joins with many such filtering conditions the logic like this makes your query difficult to understand and error prune.

LEFT JOIN with ON filter does not require additional OR condition because it filters right table before join, this query works as expected and easy to understand:

SELECT * FROM a 
             LEFT JOIN b ON a.some_id=b.some_other_id and b.some_name=6

PPD still works for ON filter and if table b is ORC, PPD will push the predicate to the lowest possible level to the ORC reader and will use built-in ORC indexes for filtering on three levels: rows, stripes and files.

More on the same topic and some tests: https://stackoverflow.com/a/46843832/2700344

So, PPD or not PPD, better use explicit ANSI syntax with ON condition and ON filtering if possible to keep the query as simple as possible and avoid converting to INNER JOIN unintentionally.

leftjoin
  • 36,950
  • 8
  • 57
  • 116