Possible reasons that spark waits and does not schedule tasks to run?

Question

This might be a very generic question but hope someone can point some hint. But I found that sometimes, my job spark seems to hit a "pause" many times:

The natural of the job is: read orc files (from a hive table), filter by certain columns, no join, then write out to another hive table.

There were total 64K tasks for my job / stage (FileScan orc, followed by Filter, Project).

The application has 500 executors, each has 4 cores. Initially, about 2000 tasks were running concurrently, things look good.

After a while, I noticed the number running tasks dropped all the way near 100. Many cores/executors were just waiting with nothing to do. (I checked the log from these waiting executors, there was no error. All assigned tasks were done on them, they were just waiting)

After about 3-5 minutes, then these waiting executors suddenly got tasks assigned and now were working happily.

Any particular reasons this can be? The application is running from spark-shell (--master yarn --deploy-mode client, with number of executors/sizes etc. specified)

Thanks!

Share screen shots of your job from spark web ui while its running or from spark history server of previous runs. — Aravind Yarram, May 05 '20 at 02:36
I had a similar problem. I could improve the situation by upgrading the Spark version from 2.3.2 to 2.4.4. — werner, May 24 '20 at 17:12

Possible reasons that spark waits and does not schedule tasks to run?

0 Answers0