why an action takes multiple jobs to get completed in spark - scala

Question

I am doing a pivot operation on top of a dataframe in spark-scala. But for the single pivot it is taking multiple jobs to get completed(as per the pic below)

What could be the possible reason?

This is rather a generic question as I experience the same in other actions well.

can you show the code pls? where the pivot is concerned – thebluephantom Nov 25 '20 at 09:16 — thebluephantom, Nov 25 '20 at 09:16
I have lot of underlying joins in the code – abc_spark Nov 25 '20 at 12:31 — abc_spark, Nov 25 '20 at 12:31

thebluephantom · Answer 1 · 2020-11-25T13:30:59.793

1

Specifically, pivot causes Spark to launch underwater jobs for the pivot values to be pivoted. You can supply them to obviate afaik, but that is not generally done.

show and reading s3 paths also cause Spark to generate extra Jobs, also take, as well schema inference by Spark.

Most of the blurb on Actions and Jobs is from RDDs. With DF's and Catalysts things like optimization mean things can be initiated so as to improved performance.

Also, the display on Spark UI is hard for many to follow. Often the name of the Job remains the same, but it concerns the work done for wide-transformations involving shuffling known as Stages. groupBy, orderBy, agg all do their thing based on "shuffle boundaries". It's the way it works. Your code shows those things.

This Spark: disk I/O on stage boundaries explanation may give some insight as well as to what is going on in the background. The output of a grouBy is input to an orderBy over two Stages.

edited Nov 25 '20 at 13:30

answered Nov 25 '20 at 09:27

thebluephantom

16,458
8
40
83

This is rather generic question. This happens for insertinto.count etc – abc_spark Nov 25 '20 at 09:41
The answer is still valid here as we can see pivot in Spark UI. – thebluephantom Nov 25 '20 at 09:49
Could you please explain your answer ? – abc_spark Nov 25 '20 at 12:32
You need to show the code, but pivot must get a list of pivot values so as to perform the pivoting subsequently – thebluephantom Nov 25 '20 at 12:35
var df31_in = df.withColumn("c1", max("c2") over Window.partitionBy("x") ).groupBy("x","c1").pivot("RANK_NUM").agg(first("x2") ).withColumnRenamed("c1", "c2") – abc_spark Nov 25 '20 at 12:40
My code is something like this. I have some series of joins prior attaining this intial df – abc_spark Nov 25 '20 at 12:40
Updated, I find the Spark UI hard to follow at times. – thebluephantom Nov 25 '20 at 12:54
ok, Let me try to understand this, as this seems lil bit complicated for me :) – abc_spark Nov 25 '20 at 13:29

why an action takes multiple jobs to get completed in spark - scala

1 Answers1