Order of Execution in Spark DAG

Question

I want to understand the way a spark Dag is created. Lets suppose I have a Spark driver program which perform 3 spark actions (say writing data on s3).

val df1= spark.read.text("S3://onepath/")
val df2= df1.select(col1,col2)
val df3= spark.read.text("s3://anotherpath/")

df1.write("")
df2.write("")
df3.write("")

I want to understand if spark will always write df1, df2 and df3 in the same order or it can improvise on its own and start writing df1 and df3 in parallel as they are not dependent on each other and then finally write df2 as its dependent on df1.

Take a look to https://stackoverflow.com/questions/49646802/how-to-run-two-spark-jobs-in-parallel-in-standalone-mode — Emiliano Martinez, Oct 18 '21 at 08:18

score 2 · Answer 1 · answered Oct 18 '21 at 08:15

Spark will always write df1 first, then df2 and then df3.

Every spark action (write, count, etc.) triggers a spark job in your spark application. Since your actions are sequential, spark won't start a job unless the previous has finished.

If you want to change this behaviour and make the writes to run concurrently, you can run them in different threads or using any other concurrent execution framework, for example:

Seq(df1, df2, df3).par.foreach(df => df.write(""))

Order of Execution in Spark DAG

1 Answers1