Order of the tables in a JOIN

Question

In spark-sql I have a query that uses several tables (both large & small) in Joins. My question is - does the order of these tables matter with respect to query performance ? For e.g.

select larger.col1, smaller.col2
from larger
Left Join smaller
on larger.key = smaller.key
Left Join larger2
on larger2.key2 = smaller.key2

I have searched online but didn't get a conclusive answer. So, will the performance vary if I change the order of the left vs right tables ?

It can. But more importantly left and right joins are not commutative - so the order can actually affect the data. — Error_2646, Jul 31 '19 at 18:04
The *result set* will change if you change the order of the joins. Performance is a lesser consideration. — Gordon Linoff, Jul 31 '19 at 18:06
see [here](https://stackoverflow.com/questions/16360860/does-sql-join-order-affect-performance) for performance and [here](https://stackoverflow.com/questions/9614922/does-the-join-order-matter-in-sql) for the effect on results — Blake, Jul 31 '19 at 18:09
what about inner joins ? will there be difference in performance if i swap the order of larger and smaller in an inner join ? — marie20, Jul 31 '19 at 18:09
The hive docs looks to address all these points https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins — Error_2646, Jul 31 '19 at 18:20
I searched thru the web and found this video explaining hive join optimizations that uses the `STREAMTABLE(x)` and the `MapJoin(x)` hints. [ https://www.youtube.com/watch?v=dwd9m1Zl04Q ] . Can I consider this explanation to be correct w.r.t Spark-SQL as well ? What are your thoughts ? — marie20, Aug 04 '19 at 12:05
Just a simple observation here - only tangentially related. The JOIN between smaller and larger2 tables - because both are on the RHS of a LEFT JOIN - can lead to the proliferation of NULLs in intermediate result sets. — v0rl0n, Aug 06 '20 at 11:29

score 2 · Accepted Answer · edited Aug 06 '20 at 07:37

2

The join order seems to be changed for optimization by Spark.

There could be :

Reorder JOIN optimizer
Reorder JOIN optimizer - star schema
Reorder JOIN optimizer - cost based optimization

The following appears to shed some light on this topic:

https://www.waitingforcode.com/apache-spark-sql/reorder-join-optimizer-star-schema/read https://www.waitingforcode.com/apache-spark-sql/reorder-join-optimizer/read https://www.waitingforcode.com/apache-spark-sql/reorder-join-optimizer-cost-based-optimization/read

edited Aug 06 '20 at 07:37

thebluephantom

16,458
8
40
83

answered Aug 06 '20 at 06:34

marie20

723
11
30

what about external sources? – thebluephantom Aug 06 '20 at 07:10

Order of the tables in a JOIN

1 Answers1