2

In spark-sql I have a query that uses several tables (both large & small) in Joins. My question is - does the order of these tables matter with respect to query performance ? For e.g.

select larger.col1, smaller.col2
from larger
Left Join smaller
on larger.key = smaller.key
Left Join larger2
on larger2.key2 = smaller.key2

I have searched online but didn't get a conclusive answer. So, will the performance vary if I change the order of the left vs right tables ?

marie20
  • 723
  • 11
  • 30
  • 1
    It can. But more importantly left and right joins are not commutative - so the order can actually affect the data. – Error_2646 Jul 31 '19 at 18:04
  • 3
    The *result set* will change if you change the order of the joins. Performance is a lesser consideration. – Gordon Linoff Jul 31 '19 at 18:06
  • see [here](https://stackoverflow.com/questions/16360860/does-sql-join-order-affect-performance) for performance and [here](https://stackoverflow.com/questions/9614922/does-the-join-order-matter-in-sql) for the effect on results – Blake Jul 31 '19 at 18:09
  • 1
    what about inner joins ? will there be difference in performance if i swap the order of larger and smaller in an inner join ? – marie20 Jul 31 '19 at 18:09
  • The hive docs looks to address all these points https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins – Error_2646 Jul 31 '19 at 18:20
  • I searched thru the web and found this video explaining hive join optimizations that uses the `STREAMTABLE(x)` and the `MapJoin(x)` hints. [ https://www.youtube.com/watch?v=dwd9m1Zl04Q ] . Can I consider this explanation to be correct w.r.t Spark-SQL as well ? What are your thoughts ? – marie20 Aug 04 '19 at 12:05
  • Just a simple observation here - only tangentially related. The JOIN between smaller and larger2 tables - because both are on the RHS of a LEFT JOIN - can lead to the proliferation of NULLs in intermediate result sets. – v0rl0n Aug 06 '20 at 11:29

1 Answers1

2

The join order seems to be changed for optimization by Spark.

There could be :

  • Reorder JOIN optimizer
  • Reorder JOIN optimizer - star schema
  • Reorder JOIN optimizer - cost based optimization

The following appears to shed some light on this topic:

https://www.waitingforcode.com/apache-spark-sql/reorder-join-optimizer-star-schema/read https://www.waitingforcode.com/apache-spark-sql/reorder-join-optimizer/read https://www.waitingforcode.com/apache-spark-sql/reorder-join-optimizer-cost-based-optimization/read

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
marie20
  • 723
  • 11
  • 30