Out of memory error on joining five large dataframes

Asked Sep 20 '17 at 09:25

Active Sep 20 '17 at 10:59

Viewed 2,401 times

I am getting out of memory error when I am joining five large dataframes.

TaskSetManager: Lost task 12.0 in stage 10.0 (TID 186, wn20-sal04s.sentience.local, executor 4): java.lang.OutOfMemoryError: Java heap space at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:504)

My dataframes are : A, B, C, D, E wherein they are joined using ID column. Full outer join is performed. Below are their sizes:

A--> 20GB

B--> 20GB

C--> 10 GB

D --> 10GB

E --> 10GB

A.join(B, Seq("ID"), "outer")
  .join(C, Seq("ID"), "outer")
  .join(D, Seq("ID"), "outer")
  .join(E, Seq("ID"), "outer")

Below is my spark submit command :

spark-submit --master yarn --driver-memory 16g --executor-memory 16g --conf spark.yarn.executor.memoryOverhead=8g --conf spark.yarn.driver.memoryOverhead=8g --conf spark.default.parallelism=60 --conf spark.sql.shuffle.partitions=60 --num-executors 17 --executor-cores 6 --class x x.jar

Could you please advise on how to achieve such a dataframes join using spark 2.1 (scala)?

edited Sep 20 '17 at 10:59

Shaido

27,497
23
70
73

asked Sep 20 '17 at 09:25

Geeta Singh

You can take a look at the answer here: https://stackoverflow.com/a/37849488/7579547 – Shaido Sep 20 '17 at 09:45
spark.sql.shuffle.partitions=60 is much too low (default is 200), try with 1000 or so – Raphael Roth Sep 20 '17 at 10:16

Out of memory error on joining five large dataframes

0 Answers0