1

Can some explain what is the difference between these 2 queries in HiveQL. Basically I wanted to filter out the dates to be greater than 2017-05-01 from table a. The second query didn't give the result i expected but the first one did. I thought they were equivalent

select a.user_id
, b.user_id
, a.event_date
, b.event_date
, to_date(a.event_date)
from default.t1 as a
left join stage.t2 as b
on a.user_id = b.user_id
and a.event_date = b.event_date
where a.event_date >= '2017-05-01'

vs.

select a.user_id
, b.user_id
, a.event_date
, b.event_date
, to_date(a.event_date)
from default.t1 as a
left join stage.t2 as b
on a.user_id = b.user_id
and a.event_date = b.event_date
and a.event_date >= '2017-05-01'`
lollerskates
  • 964
  • 1
  • 11
  • 28
  • You are using a `left join`, so in the first query you are getting results where `b.event_date` could be null and then filter the final result....in the second you put the date condition in the join, so you ensure to get the right data first....that's what I understand, but, someone could get a better explanation – Hackerman Aug 29 '17 at 17:16

2 Answers2

3

A left join keeps all rows in the first table, regardless of whether or not the on clause evaluates to true. When the on clause evaluates to true, then it also has all matching rows in the second table. Otherwise, the columns from the second table are NULL.

As a consequence, a left join basically ignores any filtering conditions on the first table. It is going to keep the rows from the first table regardless.

Actually, the situation is slightly more complicated. If the on condition does not evaluate to true, then the columns from the second table are all NULL. So, in the second query, when the dates don't match, the columns from b should be NULL.

What to do?

  • Conditions on the first table in a left join should always go in the where clause.
  • Conditions on the second table should always go in the on clause.
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • thanks for the tip on when to use filter in the `on` and `where` clause. I've always had trouble wrapping my head around that (can't visualize it in my head) – lollerskates Aug 31 '17 at 21:06
1

I find similar question, the answer by @Sandeep Jindal is the best

SQL join: where clause vs. on clause

a. WHERE clause: After joining. Records will be filtered after join has taken place.

b. ON clause - Before joining. Records (from right table) will be filtered before joining. This may end up as null in the result (since OUTER join).

buqing
  • 925
  • 8
  • 25