1

I need some help to understand how Spark decides the number of partitions and how they are processed in executors, I am sorry for this question as I know it is a repetitive question, but still even after reading lots of articles I am not able to understand I am putting a real life use case on which currently I am working, along with my spark submit config and cluster config.

My Hardware config:

3 Node machine with total Vcores=30 and Total Memory=320 GB.

spark-submit config:

spark-submit \
--verbose \
--master yarn \
--deploy-mode cluster \
--num-executors 1  \
--executor-memory 3g \
--executor-cores 2 \
--conf spark.yarn.maxAppAttempts=1 \
--conf spark.yarn.am.attemptFailuresValidityInterval=1h \
--conf spark.driver.memory=1000m \
--conf spark.speculation=true \

I am reading from a MySql database using spark dataframe Jdbc api:

val jdbcTable= sqlContext.read.format("jdbc").options(
            Map(
              "url" -> jdcbUrl,
              "driver" -> "net.sourceforge.jtds.jdbc.Driver",
              "dbtable" ->
                s"(SELECT * FROM SOMETHING WHERE COLUMN > ${lastExtractUnixTime}) as t"))
            .load

Total number of partitions created by jdbcTable DATAFRAME is 200

Questions:

  1. How did spark came up with 200 partitions, Is this a default setting?

  2. As I have only 1 executor does 200 partitions are processed in parallel in single executor or one partition is processed at a time?

  3. Do executor-cores are used to process task in each partition with the configured concurrency i.e. 2 (in my case)?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
nilesh1212
  • 1,561
  • 2
  • 26
  • 60

1 Answers1

0
  • As it is written right now Spark will use 1 partition only.
  • If you see 200 partitions it means that:

    • There is subsequent shuffle (exchange) not shown in the code.
    • You use default value for spark.sql.shuffle.partitions.
  • Parallelism will depend on the execution plan and assigned resources. It won't be higher than min(number-partitions, spark-cores). If there is a single executor it will number of threads assigned to this executor by the cluster manager.

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115