5

I used the Standalone Spark Cluster to process several files. When I executed the Driver, the data was processed on each worker using it's cores.

Now, I've read about Partitions, but I didn't get it if it's different than Worker Cores or not.

Is there a difference between setting cores number and partition numbers?

hasan.alkhatib
  • 1,509
  • 3
  • 14
  • 28
  • 1
    Possible duplicate of [What are workers, executors, cores in Spark Standalone cluster?](http://stackoverflow.com/questions/32621990/what-are-workers-executors-cores-in-spark-standalone-cluster) – arglee Nov 21 '16 at 20:56

2 Answers2

13

Simplistic view: Partition vs Number of Cores

When you invoke an action an RDD,

  • A "Job" is created for it. So, Job is a work submitted to spark.
  • Jobs are divided in to "STAGE" based on the shuffle boundary!!!
  • Each stage is further divided to tasks based on the number of partitions on the RDD. So Task is smallest unit of work for spark.
  • Now, how many of these tasks can be executed simultaneously depends on the "Number of Cores" available!!!
Liutong Chen
  • 2,915
  • 5
  • 22
  • 29
rakesh
  • 1,941
  • 2
  • 16
  • 23
4

Partition (or task) refers to a unit of work. If you have a 200G hadoop file loaded as an RDD and chunked by 128M (Spark default), then you have ~2000 partitions in this RDD. The number of cores determines how many partitions can be processed at any one time, and up to 2000 (capped at the number of partitions/tasks) can execute this RDD in parallel.

Tim
  • 3,675
  • 12
  • 25
  • 1
    So If I have a 70 cores and my input parquet has 1000 partitions, it means I can only process 70 tasks at a time correct? In which case, I am wasting the remaining 930 tasks time. However if i repartition my dataset to just 70 or 100, I might get a very large file size per partition. – thentangler Feb 01 '21 at 01:04