Optimize hive query to avoid JOIN

Question

Question is similar to this except I want to know if I can do it in one query. This is what I have working but as we all know joins are expensive. Any better hql to do this?

select a.tbl1,b.tbl2
from
(
  select count(*) as tbl1 from tbl1
) a
join
(
  select count(*) as tbl2 from tbl2
) b ON 1=1

Not that it has a 0 cost, but how expensive do you think a cross join between two records can be? — David דודו Markovitz, Feb 17 '17 at 07:44
@DuduMarkovitz I think there is some confusion between expense of calculation and overhead, I have adressed this in my answer. — Dennis Jaheruddin, Feb 17 '17 at 07:49

Alex · Answer 1 · 2017-02-20T15:45:21.043

2

If it is not critical for you to keep them as a separate columns you can use UNION ALL operation to work with row format:

select 'tbl1', count(*) from tbl1
UNION ALL
select 'tbl2', count(*) from tbl2;

This would allow you to avoid extra MAPJOIN operator in your former query. Technically you can have one less mapper in your end execution plan.

Update

In up-to-date distributions of Hadoop you will not get much differences from performance perspective of going either UNION or MAP JOIN approach as these operations would be optimized within former jobs. But keep in mind that on older versions of the cluster or basing on some configuration properties MAPJOIN could be converted into a separate job.

edited Feb 20 '17 at 15:45

answered Feb 17 '17 at 01:15

Alex

8,827
3
42
58

Later you may transpose the column to get into one row if required. – Ashish Singh Feb 17 '17 at 05:20
@AshishSingh He can use transpose but I'm not sure if it is necessary. Hive can translate this operation into another job, which would be an overhead for this particular purpose – Alex Feb 17 '17 at 09:29
Have you actually checked your assumption? It seems that `UNION ALL` involves a `MAP` job which means you are not avoiding extra job. – David דודו Markovitz Feb 18 '17 at 13:08
@DuduMarkovitz Yes, You can check the execution plan and you will see that at the end you will have only 2 MAP steps without an extra MAPJOIN step in former request. – Alex Feb 18 '17 at 13:17
I have checked and it not what I see. Please verify. – David דודו Markovitz Feb 18 '17 at 13:21
@DuduMarkovitz Reducer 2 <- Map 1 (SIMPLE_EDGE), Union 3 (CONTAINS) Reducer 5 <- Map 4 (SIMPLE_EDGE), Union 3 (CONTAINS) vs Reducer 2 <- Map 1 (SIMPLE_EDGE), Reducer 4 (BROADCAST_EDGE) Reducer 4 <- Map 3 (SIMPLE_EDGE) => (BROADCAST_EDGE) => Map Join Operator [MAPJOIN_114] – Alex Feb 18 '17 at 13:39
Are you running on TEZ by any chance? – David דודו Markovitz Feb 18 '17 at 19:43
@DuduMarkovitz No, I'm using 2.3 HDP instance with pure MR engine. But I guess the plan would make no difference here. – Alex Feb 18 '17 at 20:16
@DuduMarkovitz I guess I know what could cause the difference. This step could be either optimized in new versions of Hive or could be implemented as a separate job. It depends on the cluster version and configuration. See about MapJoin operation optimization here - https://cwiki.apache.org/confluence/display /Hive/LanguageManual+JoinOptimization – Alex Feb 18 '17 at 20:28
In the same configuration, does the OP query require an additiona mapl job? – David דודו Markovitz Feb 19 '17 at 06:14

score 2 · Answer 2 · edited Jun 20 '20 at 09:12

Yes, Joins are expensive

When it is said that joins are expensive, this typically refers to the situation where you have many records in multiple tables that need to be matched with eachother.

According to that description your join is not expensive, as you only join 2 sets with 1 record each.

But, you must be looking at overhead

Perhaps you notice that the individual counts take significantly shorter than the command which you use to count and combine the result. This would be because map and reduce operations have significant overhead (can be 30 seconds per stage).

You can play around a bit to see whether you hit a plan that does not incur much overhead, but it could well be that you are out of luck as hive does not scale down that well.

Optimize hive query to avoid JOIN

2 Answers2

Yes, Joins are expensive

But, you must be looking at overhead