I am getting out of memory error when I am joining five large dataframes.
TaskSetManager: Lost task 12.0 in stage 10.0 (TID 186, wn20-sal04s.sentience.local, executor 4): java.lang.OutOfMemoryError: Java heap space at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:504)
My dataframes are : A, B, C, D, E wherein they are joined using ID column. Full outer join is performed. Below are their sizes:
A--> 20GB
B--> 20GB
C--> 10 GB
D --> 10GB
E --> 10GB
A.join(B, Seq("ID"), "outer")
.join(C, Seq("ID"), "outer")
.join(D, Seq("ID"), "outer")
.join(E, Seq("ID"), "outer")
Below is my spark submit command :
spark-submit --master yarn --driver-memory 16g --executor-memory 16g --conf spark.yarn.executor.memoryOverhead=8g --conf spark.yarn.driver.memoryOverhead=8g --conf spark.default.parallelism=60 --conf spark.sql.shuffle.partitions=60 --num-executors 17 --executor-cores 6 --class x x.jar
Could you please advise on how to achieve such a dataframes join using spark 2.1 (scala)?