I am running a rather simple Spark job: read a couple of Parquet datasets (10-100GB) each, do a bunch of joins, and writing the result back to Parquet.
Spark always seem to get stuck on the last stage. The stage stays "Pending" even though all previous stages have completed, and there are executors waiting. I've waited up to 1.5 hours and it just stays stuck.
I have tried the following desperate measures:
- Using smaller datasets appears to work, but then the plan changes (e.g., some broadcast joins start to pop up) so that doesn't really help to troubleshoot.
- Allocating more executor or driver memory doesn't seem to help.
Any idea?
Details
- Running Spark 2.3.1 on Amazon EMR (5.17)
client-mode
on YARN- Driver thread dump
- Appears similar to Spark job showing unknown in active stages and stuck although I can't be sure
- Job details showing the stage staying in pending: