4

I have a spark cluster with 8 machines, 256 cores, 180Gb ram per machine. I have started 32 executors, with 32 cores and 40Gb ram each.

I am trying to optimize a complex application and I notice that a lot of the stages have 200 tasks. This seems sub-optimal in my case. I have tried setting the parameter spark.default.parallelism to 1024 but it appears to have no effect.

I run spark 2.0.1, in stand alone mode, my driver is hosted on a workstation running inside a pycharm debug session. I have set spark.default.parallelism in:

  • spark-defaults.conf on workstation
  • spark-defaults.conf on the cluster spark/conf directory
  • in the call to build the SparkSession on my driver

This is that call

spark = SparkSession \
    .builder \
    .master("spark://stcpgrnlp06p.options-it.com:7087") \
    .appName(__SPARK_APP_NAME__) \
    .config("spark.default.parallelism",numOfCores) \
    .getOrCreate()

I have restarted the executors since making these changes.

If I understood this correctly, having only 200 task in a stage means that my cluster is not being fully utilized?

When I watch the machines using htop I can see that I'm not getting full CPU usage. Maybe on one machine at one time, but not on all of them.

Do I need to call .rdd.repartition(1024) on my dataframes? Seems like a burden to do that everywhere.

ThatDataGuy
  • 1,969
  • 2
  • 17
  • 43
  • Try Setting in this configuration: set("spark.sql.shuffle.partitions", "8") Where 8 is the number of partitions that you want to make – Shivansh Nov 22 '16 at 16:19
  • Possible duplicate of [Number reduce tasks Spark](http://stackoverflow.com/questions/33297689/number-reduce-tasks-spark) – sgvd Nov 22 '16 at 16:50
  • But why do you only want to use 8? As far as I know it should be equal or higher to the number of tasks which are running at the same time. – Simon Schiff Nov 22 '16 at 21:17
  • 1
    So, for anyone else that finds this, tweaking the number of cores per executor to 8 and spark.sql.shuffle.partitions=256 gave the best performance in my case. – ThatDataGuy Nov 23 '16 at 09:07

2 Answers2

1

Try Setting in this configuration: set("spark.sql.shuffle.partitions", "8")

Where 8 is the number of partitions that you want to make.

Shivansh
  • 3,454
  • 23
  • 46
-2

or SparkSession,
.config("spark.sql.shuffle.partitions", "2")

Andy
  • 337
  • 3
  • 3