I need some help to understand how Spark decides the number of partitions and how they are processed in executors, I am sorry for this question as I know it is a repetitive question, but still even after reading lots of articles I am not able to understand I am putting a real life use case on which currently I am working, along with my spark submit config and cluster config.
My Hardware config:
3 Node machine with total Vcores=30 and Total Memory=320 GB.
spark-submit config:
spark-submit \
--verbose \
--master yarn \
--deploy-mode cluster \
--num-executors 1 \
--executor-memory 3g \
--executor-cores 2 \
--conf spark.yarn.maxAppAttempts=1 \
--conf spark.yarn.am.attemptFailuresValidityInterval=1h \
--conf spark.driver.memory=1000m \
--conf spark.speculation=true \
I am reading from a MySql database using spark dataframe Jdbc api:
val jdbcTable= sqlContext.read.format("jdbc").options(
Map(
"url" -> jdcbUrl,
"driver" -> "net.sourceforge.jtds.jdbc.Driver",
"dbtable" ->
s"(SELECT * FROM SOMETHING WHERE COLUMN > ${lastExtractUnixTime}) as t"))
.load
Total number of partitions created by jdbcTable DATAFRAME is 200
Questions:
How did spark came up with
200
partitions, Is this a default setting?As I have only 1 executor does
200
partitions are processed in parallel in single executor or one partition is processed at a time?Do
executor-cores
are used to process task in each partition with the configured concurrency i.e. 2 (in my case)?