Transformation task data flow

Question

After some Web research, I am still confused about what exactly happens when an Apache Spark task finishes its transformation task on one partition of the data. I understand that the transformation task is an in-memory operation and creates a new RDD, but where is this RDD stored? After all, when the task is done, the memory gets release, so the resulting RDD has to be persisted somewhere, right? It has to be passed to the next task or stage, after all. Could you please point me to some documentation?

score 0 · Answer 1 · answered Dec 20 '19 at 10:14

A task can only belong to one stage and operate on a single partition.

All tasks in a stage must be completed before the stages that follow can start.

you can check the below link for more details.

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-scheduler-Task.html

Transformation task data flow

1 Answers1