Using three or more joins within a single hive query

Question

Good afternoon, please tell me whether the output will be the same for these two situations? What are the alternatives if you need to use three or more different joins inside a single hive query.

from a
  join b on a.key=b.key
  join c on a.key=c.key
  left join u on a.key=u.key


_______
from a
  join b on a.key=b.key
  left join u on a.key=u.key
  join c on a.key=c.key

The output will be the same. Each join is assessed based on the reference columns, it doesn't really matter which order you put them in the query (at least not when the tables and reference columns are the same). I'm not sure what you mean by "what are the alternatives if you need to use three or more different joins" ...... you use whatever joins you need in your query to satisfy the output requirements — Craig, Mar 31 '21 at 09:49
Thanks. As for the second question, the fact is that hive gave an error during the second request and the documentation says that you can not use left/right/iner join together. | org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: Unexpected exception from MapJoinOperator : null — Dark Morning, Mar 31 '21 at 10:09
@DarkMorning Please provide the link where did you find that in documentation. It is a bug. Hive does allow using left/right/iner joins together. In this docs you can find examples of inner and left together: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins — leftjoin, Mar 31 '21 at 10:59
Order of tables in join condition can improve performance of SQL. Its because hive can decide which table to use as a driver and pull it in memory. you can refer to this https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/optimize-joins.html#JoinOptimization-EnhancementsforStarJoins — Koushik Roy, Mar 31 '21 at 10:59

leftjoin · Answer 1 · 2021-03-31T11:37:50.280

The result should be the same but performance may be different. Joins can duplicate or reduce the number of rows.

All kind of joins can duplicate rows (except left semi join, which may or may not reduce the number of rows) if the join key is not unique. This is normal behavior and can be done intentionally. Each subsequent join after the duplicating one will process more rows and this will definitely affect performance.

Check that join key is unique in the table you are joining with to eliminate duplication.

Inner join can also reduce the number of rows (like left-semi-join) if keys are not matching. Subsequent joins will process less rows and this will improve performance. At the same time inner join can duplicate rows(see above).

Joins which reduce the dataset should be done before other joins if you want to improve performance. Filter in join ON condition if possible because it more efficient than WHERE clause, which is being applied after all joins. Joins duplicating rows(intentionally) should be executed last.

Normally CBO decides which joins should be done first based on statistics. Alternatively you can explicitly group joins into subqueries to make sure that joins reducing the dataset are being executed first. For example like this:

  from
  (--Do inner joins first to reduce the dataset before left join
   --The order of Inner joins also matters, most restrictive one 
   --should be performed first, you can add more subqueries to make the order explicit
  select a.key, ...
  from a
  join b on a.key=b.key
  join c on a.key=c.key
  )s --subquery with inner joins will be executed before left join
  left join u on s.key=u.key

Also Map joins can be combined and executed on single vertex. Read this answer.

score 0 · Accepted Answer · answered Mar 31 '21 at 11:37

Yes, they are equivalent, although the results may not be in the same order in the result set. And if you used select *, then the columns would be in a different order.

The reason is a little subtle -- the outer joined table is not used anywhere else in the FROM clause. So, you don't have to worry about NULL values from non-matching rows.

As a general rule, I order joins in the FROM clause starting with inner joins and followed by outer joins. The clause becomes quite difficult to accurately follow when you start mixing join types. So, I recommend:

from a join
     b
     on a.key = b.key join
     c
     on a.key = c.key left join
     u
     on a.key = u.key

Using three or more joins within a single hive query

2 Answers2