2

I am running a query in Hive, which is like below, and have OR condition in the left join. When I run the select, it throws me couple of error messages.

  1. OR not supported currently in JOIN ( got to know OR works only for equi joins in Hive)

  2. Both left and right aliases encountered in JOIN 'cre_timestamp'

           a.line_id,
           a.seller,
           a.sellerid,
           a.sellername,
           a.item_no,
           a.item_cd,
           a.cre_timestamp
     from Table A
     left join Table B
     on translate(a.id,'0','') = translate(b.id,'0','')
     or translate(a.seller,'Z','') = translate(b.seller,'Z','')
     or (a.item_no=b.item_no and a.item_no is not null and a.item_cd is not null and a.item_no <> '' and a.item_cd <> '')
     left join ( select id, line_id,cre_timestamp from table x) C
     on a.id=c.id
     and a.cre_timestamp < c.cre_timestamp
     and a.cre_timestamp > date_sub(c.cre_timestamp,21)
     and translate(a.id,'0','') = translate(b.id,'0','') or a.item_cd = b.item_cd
    where a.seller is null

How can we overcome this issue?

#For 1: One way, I can try to write the query is, using a UNION, copy past the query 3 times, for the OR conditions.

#For 2:

If I cut the

and a.cre_timestamp < c.cre_timestamp
     and a.cre_timestamp > date_sub(c.cre_timestamp,21)

and bring it into the where clause at the bottom, it works fine. (want to understand why it doesn't work in a join)

On a whole, looking for a better approach, which will not affect the run-time and more optimized query, as if I change it to use a UNION, it has to process the same query 3 times which will affect the query.

Thank you for taking time to look into this.

leftjoin
  • 36,950
  • 8
  • 57
  • 116
jahan
  • 103
  • 4
  • 19

1 Answers1

0

I already tried to explain why non-equi (theta) joins do not work in map-reduce framework in this post, will not duplicate it here, please read: Why Hive can not support non-equi join

Now what happens if you move non-equality join condition to the where clause: The join will work using equality condition only and will probably produce some duplication because it cam be many to many join. These duplicates will be filtered by WHERE condition. In the worst case if you do not have equality condition at all, CROSS JOIN will be performed, which is also easy to implement using MapReduce framework, and after that you can filter rows in the where. Filtering also is easy to implement.

This is the only way how Theta-join can be implemented in Hive currently: using duplicating join on partial equality condition (or even CROSS JOIN) plus filtering, this approach can have significant negative impact on performance. However if one of the tables is small enough to fit in memory, you can compensate negative impact on performance using map-join:

set hive.auto.convert.join=true;
set hive.mapjoin.smalltable.filesize=1000000000; --small table size which can fit in memory, 
                                                 --adjust it and check Map Join operator in the plan

Also (this is not related to the question) you have redundant conditions in your query:

(a.item_no=b.item_no and a.item_no is not null and a.item_cd is not null and a.item_no <> '' and a.item_cd <> '')

a.item_no is not null - does nothing useful here because 1) this column already used in equality join condition and NULLs are not joined, 2) there is another condition a.item_no <> '' which excludes NULLs because if value does not equal empty string, it cannot be NULL as well, NULLs can not be equal or not equal to something.

The same redundant condition a.item_cd is not null because you already have a.item_cd <> '', this does not allows NULLs.

So, the whole condition can be reduced to this one:

(a.item_no=b.item_no and a.item_no <> '' and a.item_cd <> '')

And yes, splitting query into two or more + UNION is a common approach to solve the problem with OR join conditions. If you have some common filters, you can use WITH subquery to compensate scanning the whole table many times. Splitting the dataset using different filters and join conditions + UNION or UNION ALL also helps with skewed join keys. If you are using Tez, using WITH subquery will allow to read the table once (on mapper) and all other vertices will read the same results prepared by mapper, eliminating writing of intermediate results into persistence storage each time.

leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • Thank you for the detailed explanation.I will be testing the above conditions and see how it is going to work. – jahan Jan 20 '21 at 18:39