Why do I see only 200 tasks in stages?

Question

I have a spark cluster with 8 machines, 256 cores, 180Gb ram per machine. I have started 32 executors, with 32 cores and 40Gb ram each.

I am trying to optimize a complex application and I notice that a lot of the stages have 200 tasks. This seems sub-optimal in my case. I have tried setting the parameter spark.default.parallelism to 1024 but it appears to have no effect.

I run spark 2.0.1, in stand alone mode, my driver is hosted on a workstation running inside a pycharm debug session. I have set spark.default.parallelism in:

spark-defaults.conf on workstation
spark-defaults.conf on the cluster spark/conf directory
in the call to build the SparkSession on my driver

This is that call

spark = SparkSession \
    .builder \
    .master("spark://stcpgrnlp06p.options-it.com:7087") \
    .appName(__SPARK_APP_NAME__) \
    .config("spark.default.parallelism",numOfCores) \
    .getOrCreate()

I have restarted the executors since making these changes.

If I understood this correctly, having only 200 task in a stage means that my cluster is not being fully utilized?

When I watch the machines using htop I can see that I'm not getting full CPU usage. Maybe on one machine at one time, but not on all of them.

Do I need to call .rdd.repartition(1024) on my dataframes? Seems like a burden to do that everywhere.

Try Setting in this configuration: set("spark.sql.shuffle.partitions", "8") Where 8 is the number of partitions that you want to make — Shivansh, Nov 22 '16 at 16:19
Possible duplicate of [Number reduce tasks Spark](http://stackoverflow.com/questions/33297689/number-reduce-tasks-spark) — sgvd, Nov 22 '16 at 16:50
But why do you only want to use 8? As far as I know it should be equal or higher to the number of tasks which are running at the same time. — Simon Schiff, Nov 22 '16 at 21:17
So, for anyone else that finds this, tweaking the number of cores per executor to 8 and spark.sql.shuffle.partitions=256 gave the best performance in my case. — ThatDataGuy, Nov 23 '16 at 09:07

score 1 · Accepted Answer · answered Nov 22 '16 at 16:47

1

Try Setting in this configuration: set("spark.sql.shuffle.partitions", "8")

Where 8 is the number of partitions that you want to make.

answered Nov 22 '16 at 16:47

Shivansh

3,454
23
46

score -2 · Answer 2 · answered Apr 04 '18 at 09:53

-2

or SparkSession,
.config("spark.sql.shuffle.partitions", "2")

answered Apr 04 '18 at 09:53

Andy

337
3
3

Why do I see only 200 tasks in stages?

2 Answers2