What is the difference between SPARK Partitions and Worker Cores?

Question

I used the Standalone Spark Cluster to process several files. When I executed the Driver, the data was processed on each worker using it's cores.

Now, I've read about Partitions, but I didn't get it if it's different than Worker Cores or not.

Is there a difference between setting cores number and partition numbers?

Possible duplicate of [What are workers, executors, cores in Spark Standalone cluster?](http://stackoverflow.com/questions/32621990/what-are-workers-executors-cores-in-spark-standalone-cluster) — arglee, Nov 21 '16 at 20:56

score 13 · Accepted Answer · edited Dec 24 '22 at 20:53

Simplistic view: Partition vs Number of Cores

When you invoke an action an RDD,

A "Job" is created for it. So, Job is a work submitted to spark.
Jobs are divided in to "STAGE" based on the shuffle boundary!!!
Each stage is further divided to tasks based on the number of partitions on the RDD. So Task is smallest unit of work for spark.
Now, how many of these tasks can be executed simultaneously depends on the "Number of Cores" available!!!

score 4 · Answer 2 · answered Nov 21 '16 at 23:22

4

Partition (or task) refers to a unit of work. If you have a 200G hadoop file loaded as an RDD and chunked by 128M (Spark default), then you have ~2000 partitions in this RDD. The number of cores determines how many partitions can be processed at any one time, and up to 2000 (capped at the number of partitions/tasks) can execute this RDD in parallel.

answered Nov 21 '16 at 23:22

Tim

3,675
12
25

1

So If I have a 70 cores and my input parquet has 1000 partitions, it means I can only process 70 tasks at a time correct? In which case, I am wasting the remaining 930 tasks time. However if i repartition my dataset to just 70 or 100, I might get a very large file size per partition. – thentangler Feb 01 '21 at 01:04

What is the difference between SPARK Partitions and Worker Cores?

2 Answers2