1

enter image description hereI have around 80GB of data , everything is going smooth till last shuffle task comes up ,all the task are getting finished within 30 mins, but last task takes more than 2 hours to complete it . enter image description here Joins : (left join) Joining 3-tables , one of the table is small relatively (2 MB )data , for that setting broadcast variable , even I removed that 3rd table , It did not resolved my issue .

below is the parameters that configured .

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "904857600")
spark.conf.set("spark.cleaner.referenceTracking.blocking", "false")
spark.conf.set("spark.cleaner.periodicGC.interval", "5min")
spark.conf.set("spark.default.parallelism","6000")
spark.conf.set("spark.sql.shuffle.partitions","2000")
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

1 Answers1

0

You are suffering from data Skew. Essentially most of the work is being done by 1 node instead of the work being distributed across multiple nodes. This is why I wanted clarification of job or [task/stage].

You should consider adding a salt to your join key to help distribute the work across multiple nodes. It will require more shuffles but it will lesson the impact on one node doing all the work.

  1. Add salt to all columns in the join

  2. Do your 3 way Join with salt column included.

  3. Then do a secondary group by to remove the salt from the query.

This will better distribute the work.

Matt Andruff
  • 4,974
  • 1
  • 5
  • 21