Apache Spark: Relationship between action and job, Spark UI

Question

To the best of my understanding till date, in spark a job is submitted whenever an action is called on a dataset/dataframe. the job may further be divided into stages and tasks, which I understand how to find out the number of stages and tasks. Given below is my small code

    val spark = SparkSession.builder().master("local").getOrCreate()


    val df = spark.read.json("/Users/vipulrajan/Downloads/demoStuff/data/rows/*.json").select("user_id", "os", "datetime", "response_time_ms")


    df.show()

    df.groupBy("user_id").count().show

To the best of my understanding it should have submitted one job at line 4 when I read. one on the first show and one on the second show. The first two assumptions are correct, but for the second show it submits 5 jobs. I can't understand why. Below is the screenshot of my UI

as you can see job 0 for reading the json, job 1 for the first show and 5 jobs for the second show. Can anyone help me understand what is this job in the spark UI?

spark also submit a job for transformation. I think that your job Id 2,3 is for the groupBy. Job Id 4,5 for the count then only 1 job for show() , because show() is on the master and groupBy distribued across the executor. did u have 2 executor ? — maxime G, Oct 23 '18 at 21:56
Possible duplicate of [Why does SparkSession execute twice for one action?](https://stackoverflow.com/q/38924623/10465355) — 10465355, Nov 01 '18 at 22:17

score 0 · Answer 1 · answered Nov 08 '18 at 17:19

0

Add something like df.groupBy("user_id").count().explain() to see, what actually are under the hood of your last show().

answered Nov 08 '18 at 17:19

mddm

1

Apache Spark: Relationship between action and job, Spark UI

1 Answers1