Specifically, pivot
causes Spark to launch underwater jobs for the pivot values to be pivoted. You can supply them to obviate afaik, but that is not generally done.
show
and reading s3 paths
also cause Spark to generate extra Jobs, also take
, as well schema inference by
Spark.
Most of the blurb on Actions and Jobs is from RDDs. With DF's and Catalysts things like optimization mean things can be initiated so as to improved performance.
Also, the display on Spark UI is hard for many to follow. Often the name of the Job remains the same, but it concerns the work done for wide-transformations
involving shuffling
known as Stages
. groupBy, orderBy, agg
all do their thing based on "shuffle boundaries". It's the way it works. Your code shows those things.
This Spark: disk I/O on stage boundaries explanation may give some insight as well as to what is going on in the background. The output of a grouBy is input to an orderBy over two Stages.