Spark Shuffle takes place because Spark needs to transmit data across Stages

Question

spark Document: In Spark, data is generally not distributed across partitions to be in the necessary place for a specific operation. During computations, a single task will operate on a single partition - thus, to organize all the data for a single reduceByKey reduce task to execute, Spark needs to perform an all-to-all operation. It must read from all partitions to find all the values for all keys, and then bring together values across partitions to compute the final result for each key - this is called the shuffle.

spark document: "This typically involves copying data across executors and machines, making the shuffle a complex and costly operation."

my understanding is:"This typically involves copying data across "task", making the shuffle a complex and costly operation."

I think computing shuffle across tasks. Is this understanding correct?

Does this answer your question? [Spark: disk I/O on stage boundaries explanation](https://stackoverflow.com/questions/58699907/spark-disk-i-o-on-stage-boundaries-explanation) — thebluephantom, Jun 04 '20 at 08:54

score 0 · Answer 1 · answered Jun 04 '20 at 05:22

Your understanding is somewhat accurate, although I think the docs were trying to emphasize the more expensive operation of sending data over the network, as opposed to staying on a single node, which can have multiple tasks, especially if it has multiple cores.

There are some nuances:

Quote from Cloudera doc: "A stage is a collection of tasks that run the same code, each on a different subset of the data." Generally, the shuffle data is the output of a stage (set of tasks) saved for the next (dependent) stage to be run afterwards.

Spark 2.x docs: "During a shuffle, the Spark executor first writes its own map outputs locally to disk, and then acts as the server for those files when other executors attempt to fetch them." This is the end part of a stage. Furthermore, the service providing shuffle data is "a long-running process that runs on each node of your cluster independently of your Spark applications and their executors". This means tasks, too.

As you mentioned, a shuffle can happen at the end of the stage, for example, if there is a reduceByKey involved. However, the output of the tasks is managed outside of the tasks themselves, so the tasks are not typically copying data amongst themselves directly. Rather, the data is saved somewhere on the node, and fetched later on-demand.

reduceByKey is in fact quite economical and thus a bad example, more like groupBy, etc. — thebluephantom, Jun 04 '20 at 08:17
Good point, but both operations would technically still result in a shuffle, so the OP is right. — ELinda, Jun 04 '20 at 13:54
You get my point I believe, different sort of shuffle to repartition — thebluephantom, Jun 04 '20 at 14:01

Spark Shuffle takes place because Spark needs to transmit data across Stages

1 Answers1