4

To the best of my understanding till date, in spark a job is submitted whenever an action is called on a dataset/dataframe. the job may further be divided into stages and tasks, which I understand how to find out the number of stages and tasks. Given below is my small code

    val spark = SparkSession.builder().master("local").getOrCreate()


    val df = spark.read.json("/Users/vipulrajan/Downloads/demoStuff/data/rows/*.json").select("user_id", "os", "datetime", "response_time_ms")


    df.show()

    df.groupBy("user_id").count().show

To the best of my understanding it should have submitted one job at line 4 when I read. one on the first show and one on the second show. The first two assumptions are correct, but for the second show it submits 5 jobs. I can't understand why. Below is the screenshot of my UI

enter image description here

as you can see job 0 for reading the json, job 1 for the first show and 5 jobs for the second show. Can anyone help me understand what is this job in the spark UI?

Vipul Rajan
  • 494
  • 1
  • 5
  • 16
  • spark also submit a job for transformation. I think that your job Id 2,3 is for the groupBy. Job Id 4,5 for the count then only 1 job for show() , because show() is on the master and groupBy distribued across the executor. did u have 2 executor ? – maxime G Oct 23 '18 at 21:56
  • No, there was only one executor with a single core. – Vipul Rajan Oct 24 '18 at 03:43
  • Possible duplicate of [Why does SparkSession execute twice for one action?](https://stackoverflow.com/q/38924623/10465355) – 10465355 Nov 01 '18 at 22:17

1 Answers1

0

Add something like df.groupBy("user_id").count().explain() to see, what actually are under the hood of your last show().

mddm
  • 1