I am looking through some hive queries we are running as part of analytics on our hadoop cluster, but I am having trouble understanding one. This is the Hive QL query
SELECT
c_id, v_id, COUNT(DISTINCT(m_id)) AS participants,
cast(date_sub(current_date, ${window}) as string) as event_date
from (
select
a.c_id, a.v_id, a.user_id,
case
when c.id1 is not null and a.timestamp <= c.stitching_ts then c.id2 else a.m_id
end as m_id
from (
select * from first
where event_date <= cast(date_sub(current_date, ${window}) as string)
) a
join (
select * from second
) b on a.c_id = b.c_id
left join third c
on a.user_id = c.id1
) dx
group by c_id, v_id;
I have changed the names but otherwise this is the select statement being used to insert overwrite to another table. Regarding the join
join (
select * from second
) b on a.c_id = b.c_id
b is not used anywhere except for join condition, so is this join serving any purpose at all?
Is it for making sure that this join only has entries where c_id is present in second table? Would a where IN condition be better if thats all this is doing.
Or I can just remove this join and it won't make any difference at all.
Thanks.