0

I am looking through some hive queries we are running as part of analytics on our hadoop cluster, but I am having trouble understanding one. This is the Hive QL query

SELECT 
    c_id, v_id, COUNT(DISTINCT(m_id)) AS participants, 
    cast(date_sub(current_date, ${window}) as string) as event_date 
from (
    select 
        a.c_id, a.v_id, a.user_id, 
        case 
            when c.id1 is not null and a.timestamp <= c.stitching_ts then c.id2 else a.m_id 
        end as m_id 
    from (
        select * from first
        where event_date <= cast(date_sub(current_date, ${window}) as string)
    ) a 
    join (
        select * from second
    ) b on a.c_id = b.c_id 
    left join third c 
    on a.user_id = c.id1
    ) dx 
group by c_id, v_id;

I have changed the names but otherwise this is the select statement being used to insert overwrite to another table. Regarding the join

join (
        select * from second
    ) b on a.c_id = b.c_id 

b is not used anywhere except for join condition, so is this join serving any purpose at all?
Is it for making sure that this join only has entries where c_id is present in second table? Would a where IN condition be better if thats all this is doing. Or I can just remove this join and it won't make any difference at all.

Thanks.

uptoNoGood
  • 566
  • 5
  • 20
  • A subquery at the top level of a FROM needs to have an alias whether you need to use it or not, that's the way the language is defined. That's all there is to it. PS Since you need the aliases for the ON, why are you even asking this? What exactly is your question? PS This can expected to be an easily found duplicate. But you need to clearly, concisely & fully phrase your question/goal/problem to google for it. – philipxy Jan 18 '22 at 23:30
  • @philipxy the question is not about the need of alias, but about the need of the join – uptoNoGood Jan 19 '22 at 05:51
  • Then the post isn't clear. Please read it & edit to be clear (including the title). Ask 1 (specific researched non-duplicate) question. (Not 1 in the title & 4 in the post.) Anyway whatever you're trying to ask, it will be an easily found duplicate. – philipxy Jan 19 '22 at 06:04
  • I have edited the post a little but title mentions purpose of a join already – uptoNoGood Jan 19 '22 at 06:05
  • That's no improvement. Read the post. What is the question? PS What does "serving any purpose" mean? It's there because the writer wanted the effect per how the language is defined. You aren't clearly describing what problem you see with clear justification per the manual. PS [Is there any rule of thumb to construct SQL query from a human-readable description?](https://stackoverflow.com/a/33952141/3404097) – philipxy Jan 19 '22 at 06:16

1 Answers1

2
  1. Join (any inner, left or right) can duplicate rows if join key in joined dataset is not unique. For example if a contains single row with c_id=1 and b contains two rows with c_id=1, the result will be two rows with a.c_id=1.
  2. Join (inner) can filter rows if join key is absent in joined dataset. I believe this is what it meant to do.

If the goal is to get only rows with keys present in both datasets(filter) and you do not want duplication, and you do not use columns from joined dataset, then better use LEFT SEMI JOIN instead of JOIN, it will work as filter only even if there are duplicated keys in joined dataset:

left semi join (
                select c_id from second
               ) b on a.c_id = b.c_id 

This is much safer way to filter rows only which exist in both a and b and avoid unintended duplication.

You can replace join with WHERE IN/EXISTS, but it makes no difference, it is implemented as the same JOIN, check the EXPLAIN output and you will see the same query plan. Better use LEFT SEMI JOIN, it implements uncorrelated IN/EXISTS in efficient way.

If you prefer to move it to the WHERE:

WHERE a.c_id IN (select c_id from second)

or correlated EXISTS:

WHERE EXISTS (select 1 from second b where a.c_id=b.c_id)

But as I said, all of them are implemented internally using JOIN operator.

leftjoin
  • 36,950
  • 8
  • 57
  • 116