0

I am getting out of memory error when I am joining five large dataframes.

TaskSetManager: Lost task 12.0 in stage 10.0 (TID 186, wn20-sal04s.sentience.local, executor 4): java.lang.OutOfMemoryError: Java heap space at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:504)

My dataframes are : A, B, C, D, E wherein they are joined using ID column. Full outer join is performed. Below are their sizes:

A--> 20GB

B--> 20GB

C--> 10 GB

D --> 10GB

E --> 10GB

A.join(B, Seq("ID"), "outer")
  .join(C, Seq("ID"), "outer")
  .join(D, Seq("ID"), "outer")
  .join(E, Seq("ID"), "outer")

Below is my spark submit command :

spark-submit --master yarn --driver-memory 16g --executor-memory 16g --conf spark.yarn.executor.memoryOverhead=8g --conf spark.yarn.driver.memoryOverhead=8g --conf spark.default.parallelism=60 --conf spark.sql.shuffle.partitions=60 --num-executors 17 --executor-cores 6 --class x x.jar

Could you please advise on how to achieve such a dataframes join using spark 2.1 (scala)?

Shaido
  • 27,497
  • 23
  • 70
  • 73
Geeta Singh
  • 171
  • 3
  • 7

0 Answers0