we have a spark job that's taking long time to complete, Looked at the spark WebUI and I see lot of shuffling. Couple of things I tried but no luck so far. Increased the sql.shuffle partitions(tried 320,640 and 1600), # of executors (8) and memory (10/12gb) and 4 cores but no significant improvement. Appreciate any guidance on below:
1)when I see the event time line in spark web UI, only one executor is doing most of the processing and rest I don't see any significant activity -
Any pointers on how to investigate further will be of great help! basically looking for documentation on the event timeline as i see single executor is performing bulk of hte work and how to use the metrics to fix the perf issue by adjusting the spark configuration parameters if thats an option?