How Apache spark calculates partitions and how partitions are processed in executor

Question

I need some help to understand how Spark decides the number of partitions and how they are processed in executors, I am sorry for this question as I know it is a repetitive question, but still even after reading lots of articles I am not able to understand I am putting a real life use case on which currently I am working, along with my spark submit config and cluster config.

My Hardware config:

3 Node machine with total Vcores=30 and Total Memory=320 GB.

spark-submit config:

spark-submit \
--verbose \
--master yarn \
--deploy-mode cluster \
--num-executors 1  \
--executor-memory 3g \
--executor-cores 2 \
--conf spark.yarn.maxAppAttempts=1 \
--conf spark.yarn.am.attemptFailuresValidityInterval=1h \
--conf spark.driver.memory=1000m \
--conf spark.speculation=true \

I am reading from a MySql database using spark dataframe Jdbc api:

val jdbcTable= sqlContext.read.format("jdbc").options(
            Map(
              "url" -> jdcbUrl,
              "driver" -> "net.sourceforge.jtds.jdbc.Driver",
              "dbtable" ->
                s"(SELECT * FROM SOMETHING WHERE COLUMN > ${lastExtractUnixTime}) as t"))
            .load

Total number of partitions created by jdbcTable DATAFRAME is 200

Questions:

How did spark came up with 200 partitions, Is this a default setting?
As I have only 1 executor does 200 partitions are processed in parallel in single executor or one partition is processed at a time?
Do executor-cores are used to process task in each partition with the configured concurrency i.e. 2 (in my case)?

score 0 · Answer 1 · answered Sep 01 '17 at 15:22

As it is written right now Spark will use 1 partition only.
If you see 200 partitions it means that:
- There is subsequent shuffle (exchange) not shown in the code.
- You use default value for spark.sql.shuffle.partitions.
Parallelism will depend on the execution plan and assigned resources. It won't be higher than min(number-partitions, spark-cores). If there is a single executor it will number of threads assigned to this executor by the cluster manager.

How Apache spark calculates partitions and how partitions are processed in executor

1 Answers1