Suppose two tables:
table1.c1 table1.c2
1 1 A
2 1 B
3 1 C
4 2 A
5 2 B
and
table2.c1 table2.c2
1 2 A
2 2 D
3 3 A
4 3 B
When I do:
select distinct t1.c1, t2.c2
from
schema.table1 t1
join
schema.table2 t2
on (t1.c2 = t2.c2
and t1.c1 = t2.c1
and t1.c1 = 2)
in Hive, I get:
t1.c1 t2.c2
1 2 A
This is the expected result, no problem. But, when I do:
select distinct t1.c1, t2.c2
from
schema.table1 t1
left join
schema.table2 t2
on (t1.c2 = t2.c2
and t1.c1 = t2.c1
and t1.c1 = 2)
I get:
t1.c1 t2.c2
1 1 NULL
2 2 NULL
3 2 A
So, filter in ON clause seems not to work like I had expected. The filters t1.c1 = t2.c1
and t1.c1 = 2
hasn't been applied when, in the LEFT JOIN, it doesn't find the key on the second table so t2.c2
is NULL
.
I suppose that the answer must be in doc (May be in the 'Joins occur BEFORE WHERE CLAUSES' section?) But still I don't understand the difference.
How is the process to give different results?