left outer join results become bigger on hive

Question

All, I want to get reliable results from this query:

SELECT ..

FROM
(
SELECT 
 CO_CODE,
REP.cua cua,
PRD.PRODUCT_DESC, 
REGEXP_EXTRACT(B.rfbbn,'^(?:[^*]*\\*){2}([^*]*)',1) cllt,
NVL(CCY_bbce,0) bbce, 
B.TYPE, 
A.conn_keyy

FROM 
(
SELECT conn_keyy , ext_date FROM 
(tablee.aa) A
)aaxyz 
WHERE flag = 'Y'
)A

LEFT OUTER JOIN 
tablee.B
ON A.conn_keyy = B.conn_keyy

LEFT OUTER JOIN (SELECT DISTINCT * FROM tablee.cc) CPLCUR
ON CPLCUR.conn_keyy = A.conn_keyy
AND CPLCUR.cllt = REGEXP_EXTRACT(B.rfbbn,'^(?:[^*]*\\*){2}([^*]*)',1)
AND CPLCUR.dtdt = '1999' 

LEFT OUTER JOIN (SELECT DISTINCT * FROM tablee.dd) CPLBAL
ON CPLBAL.conn_keyy = A.conn_keyy
AND CPLBAL.SEQUENCE = CPLCUR.SEQUENCE
AND CPLBAL.dtdt = '1999' 

LEFT OUTER JOIN (SELECT DISTINCT * FROM tablee.ee) CPLCCY
ON CPLCCY.conn_keyy = A.conn_keyy
AND CPLCCY.SEQUENCE = CPLCUR.SEQUENCE
AND CPLCCY.dtdt = '1999' 

LEFT OUTER JOIN (SELECT DISTINCT * FROM tablee.ff) CPLMOV
ON CPLMOV.conn_keyy = A.conn_keyy
AND CPLMOV.SEQUENCE = CPLCUR.SEQUENCE
AND CPLMOV.dtdt = '1999' 

LEFT OUTER JOIN
 (tablee.REP)REP 
ON REP.relino = B.lnido

LEFT OUTER JOIN tablee.P PRD
ON PRD.PRODUCT_CODE = REGEXP_EXTRACT(A.conn_keyy,'[.]([^.]+)',1)
AND PRD.dtdt = '1999'

WHERE B.lnido LIKE 'PLCONS1%'
) rrvv;

fyi, the select count (*) of A is around 60,000

I just wondering why my query results turns 1.5 billions.. What did I miss? What went wrong when i operate the left-outer-join?

First of all LEFT JOIN on table B is not really LEFT JOIN but it is INNER JOIN as you have used `B.lnido LIKE 'PLCONS1%'` in WHERE clause. Next, try to remove one by one LEFT JOIN to check if count of the rows are reduced as It seems that there is 1 to many relationship which is multiplied in the resultant records. — Popeye, Jan 12 '21 at 06:17
if i may ask, how many column it should produce on case like this? fyi, when I select count (*) from B the results is around 350,000.. Or should I specify the distinct columns? https://sqlandhadoop.com/hive-distinct/ — thecardcaptor, Jan 12 '21 at 06:21
Please do what I say in my 1st comment [here](https://stackoverflow.com/q/55225131/3404097) including chopping & simplifying your code to the first subexpression that does not return what you expect & say what you expect & why you expect that, with justification per authoritative documentation. [left join](https://stackoverflow.com/a/46091641/3404097) [mre] Otherwise you're just asking for yet another manual to be written with a bespoke tutorial & no clues about where your misconceptions lie. PS Please format your code reasonably. PS Your question when pinned down will be an easily found faq. — philipxy, Jan 13 '21 at 18:24

leftjoin · Answer 1 · 2021-01-12T12:11:40.347

1

Join can duplicate rows if the join key is not unique in second table and if join key is not unique in both tables, it will produce much more duplicates.

For example:

with 

A as (
select 1 key, 'one' name
union all
select 1 key, 'two' name
),

B as (
select 1 key, 'one' name
union all
select 1 key, 'two' name
)

select *
  from A left join B on A.key=B.key

Result in four rows and each table contain only two:

a.key   a.name  b.key   b.name
1       one     1       one
1       one     1       two
1       two     1       one
1       two     1       two

How to find key with duplicates:

select B.conn_keyy, count(*) cnt 
  from tablee.B
group by B.conn_keyy
 having count(*)>1
order by cnt desc limit 100;

Check every table you are joining with and decide what you can do: apply filtering, distinct or add more join keys to make join (one to one or zero) or (many to one or zero)

edited Jan 12 '21 at 12:11

answered Jan 12 '21 at 08:42

leftjoin

36,950
8
57
116

sorry, edited. the A querys contains the unique values, it gives 0 results. On the other hand, the gives many results (not unique conn_keyy). I think that I don't perform the left-outer-join appropriately. what do you think? https://cwiki.apache.org/confluence/plugins/servlet/mobile?contentId=27362039#content/view/27362039 – thecardcaptor Jan 12 '21 at 10:53
@thecardcaptor Maybe you should use additional join key. (complex join key). Maybe you need to remove duplicates before join. This is nothing to do with Hive documentation or Joins syntax in general. Check join keys. if it is not unique, after first join it will be duplicates, second join will produce even more dups. If join key is not unique - join produces duplicates, it is NORMAL join behavior. just study your data and chose correct join key (it can be complex key, consisting of many columns) – leftjoin Jan 12 '21 at 11:08
I already trying with combination coalesce (),It stopped when I trying to join the other conn_key based on A.. https://stackoverflow.com/questions/65705636/left-outer-join-with-subqueries-in-exist-at-hive – thecardcaptor Jan 13 '21 at 16:12

left outer join results become bigger on hive

1 Answers1

Linked