I have large DataFrames:A(200g), B(20m), C(15m), D(10m), E(12m), I want to join them together: A join B, C join D and E using spark sql in same SparkSession**. Just like:
absql:sql("select * from A a inner join B b on a.id=b.id").write.csv("/path/for/ab")
cdesql:sql("select * from C c inner join D d on c.id=d.id inner join E e on c.id=e.id").write.csv("/path/for/cde")
Problem:
When I use default spark.sql.autoBroadcastJoinThreshold=10m
- absql will take long time, the reason is absql skew.
- cdesql is normal
When I set spark.sql.autoBroadcastJoinThreshold=20m
- C,D,E will be broadcasted and all of the tasks will be executed in same executor, it still take long time.
- if set num-executors=200, it take a long time to broadcast
- absql is normal