Hive: LEFT JOIN vs JOIN gives different results with filter in ON clause

Question

Suppose two tables:

    table1.c1   table1.c2
1   1           A
2   1           B
3   1           C
4   2           A
5   2           B

and

    table2.c1   table2.c2
1   2           A
2   2           D
3   3           A
4   3           B

When I do:

select distinct t1.c1, t2.c2
from
schema.table1 t1
join
schema.table2 t2
on (t1.c2 = t2.c2 
    and t1.c1 = t2.c1
    and t1.c1 = 2)

in Hive, I get:

    t1.c1   t2.c2
1   2   A

This is the expected result, no problem. But, when I do:

select distinct t1.c1, t2.c2
from
schema.table1 t1
left join
schema.table2 t2
on (t1.c2 = t2.c2 
    and t1.c1 = t2.c1
    and t1.c1 = 2)

I get:

    t1.c1   t2.c2
1   1       NULL
2   2       NULL
3   2       A

So, filter in ON clause seems not to work like I had expected. The filters t1.c1 = t2.c1 and t1.c1 = 2 hasn't been applied when, in the LEFT JOIN, it doesn't find the key on the second table so t2.c2 is NULL.

I suppose that the answer must be in doc (May be in the 'Joins occur BEFORE WHERE CLAUSES' section?) But still I don't understand the difference.

How is the process to give different results?

All rows are returned **FROM table1** when you use a left join even if there is no match in table2. But an inner join REQUIRES A MATCH from both tables, so you get less rows from the first query and more rows in the second query. This behaviour is not specific to HIVE by the way. — Paul Maxwell, Sep 01 '16 at 12:24
@piet.t gives me the answer in his second comment: LEFT JOIN never filters any rows from the left table because of the ON-condition. I did know that LEFT JOIN returns all rows of table 1 but I didn't know that it's not possible to filter that table in ON clause. — Amanda, Sep 01 '16 at 12:50

piet.t · Accepted Answer · 2016-09-01T12:48:09.263

6

It's just the way LEFT (OUTER) JOIN works:

You specify some matching-condition in your ON-clause. If there is a matching row found in the "right" table it is joined to the one in the "left" table. If there is no matching row it will still return the "left" row together with all fields from the "right" table set to null. So it will never filter any rows from the "left" table based on the ON-condition. Using the Hive-documentation's terminology: the left table is a "preserved row table" while the right table is a "null supplying table".

This is opposed to an INNER JOIN that returns only rows which have a matching partner in the other table. So there isn't a "preserved table" and there is no need for a "null supplying table"

edited Sep 01 '16 at 12:48

answered Sep 01 '16 at 12:20

piet.t

11,718
21
43
52

Thanks for the answer, I know it but, Why in the LEFT JOIN case the filters `t1.c1 = t2.c1` and `t1.c1 = 2` are not being applied? If you analyze the output the first two rows don't match neither. – Amanda Sep 01 '16 at 12:32
2

That's what I said: there is no row that is matching the `ON`-condition, so all "right" fields are set to null. Keep in mind that a `LEFT JOIN` never filters *any* rows from the left table because of the `ON`-condition. – piet.t Sep 01 '16 at 12:36
OK! Your 'a LEFT JOIN never filters any rows from the left table because of the ON-condition' is exactly the answer for me. Is it because the left table is a preserved row table? Many Thanks! – Amanda Sep 01 '16 at 12:42
Done. Thanks again. – Amanda Sep 01 '16 at 12:56
@Amamnda: sorry for being too brief, but I meant *I* added it as an edit to *my answer* - questions should not contain an answer. I already removed your edit.... – piet.t Sep 01 '16 at 13:01
sorry if i impolite, how about many left-outer-joins at the same time like this? should i use inner-join from very start? https://stackoverflow.com/questions/65678744/left-outer-join-results-become-bigger-on-hive?noredirect=1#comment116123320_65678744 – thecardcaptor Jan 12 '21 at 07:44

score 0 · Answer 2 · answered Sep 01 '16 at 12:27

0

LEFT JOIN is supposed to different output than FULL JOIN.

the output of LEFT join will contains all the data from left table (written first among the two) and if there is no corresponding data for that in right table, NULL values will be displayed. If you remove distinct from your query and run it, the output should clear your confusion on how LEFT/RIGHT joins work.

Full Join output

t1.c1   t1.c2   t2.c2
2       a       a
2       a       d
2       b       a
2       b       d

Left Join output

t1.c1   t1.c2   t2.c2
 1      a       null
 1      b       null
 1      c       null
 2      a       a
 2      a       d
 2      b       a
 2      b       d

answered Sep 01 '16 at 12:27

Sumeet Gupta

198
1
13

Thanks for the answer, I know it but, Why in the LEFT JOIN case the filters `t1.c1 = t2.c1` and `t1.c1 = 2` are not being applied? If you analyze the output the first two rows don't match neither, `1 <> NULL` and `1<> 2` and `2 == 2` but `2 <> NULL` – Amanda Sep 01 '16 at 12:35
yes.. You are right. sorry my mistake I couldn't understand the question at first. I don't have access to hive right now, but your first query must return '2 A '2 B – Sumeet Gupta Sep 01 '16 at 12:52
maybe the issue is being caused because you are trying to do a static comparison (t1.c1 = 2) within join whereas this should correctly go in there WHERE section instead of JOIN ON. – Sumeet Gupta Sep 01 '16 at 12:55

score 0 · Answer 3 · answered Oct 09 '18 at 21:03

Hive apparently treats Join criteria differently in Inner Joins vs. Left Joins. In Inner Joins, you can put filter criteria into the ON clause, but in Left Joins, you need to put filter criteria for the primary table (t1 in this case) into a separate WHERE clause. If you try

`select distinct t1.c1, t2.c2
from
schema.table1 t1
left join
schema.table2 t2
on (t1.c2 = t2.c2 
    and t1.c1 = t2.c1)
where t1.c1 = 2;`

you should get the expected results.

Hive: LEFT JOIN vs JOIN gives different results with filter in ON clause

3 Answers3