2

I am running a rather simple Spark job: read a couple of Parquet datasets (10-100GB) each, do a bunch of joins, and writing the result back to Parquet.

Spark always seem to get stuck on the last stage. The stage stays "Pending" even though all previous stages have completed, and there are executors waiting. I've waited up to 1.5 hours and it just stays stuck.

I have tried the following desperate measures:

  • Using smaller datasets appears to work, but then the plan changes (e.g., some broadcast joins start to pop up) so that doesn't really help to troubleshoot.
  • Allocating more executor or driver memory doesn't seem to help.

Any idea?


Details

job details

pay
  • 133
  • 2
  • 8
  • You should provide the relevant part of the code otherwise it's guesswork. My bet is that you have an expensive join or some other transformation somewhere and since spark datasets are calculated lazily it shows up at the last stage. Try isolating parts of your code to track down the bottleneck. – steven35 Oct 08 '18 at 14:46
  • I do have a rather large join. However, are there valid cases where the stage would stay pending and not start at all? Does the amount of data matter? I am trying to ascertain if this could be something else than a bug in Spark. – pay Oct 08 '18 at 14:59
  • I recently had a scenario where I did a left outer join on two columns which produced exactly the same "hanging" you are experiencing. Refactoring the single join it to two separate consecutive joins fixed the issue. The amount of data wasn't particularly large, (<8G) so some operations just seem to be overly inefficient. To find a bottleneck, I always go step by step, commenting out the rest of the code until I have found the issue, calling `count()` on the DF so the computation is carried out. – steven35 Oct 08 '18 at 15:10

0 Answers0