ShuffleQueryStage and ReusedQueryStage in Spark SQL query Plans

Question

What does ShuffleQueryStage 20 and ReusedQueryStage 16mean in a Spark SQL Query Plan below ? I have shared a part of the query plan generated for my query.

I am using Spark 2.4.7.

:     +- ReusedQueryStage 16
:        +- BroadcastQueryStage 7
:           +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
:              +- AdaptiveSparkPlan(isFinalPlan=true)
:                 +- *(11) HashAggregate(keys=[src_clmorigid#21055], functions=[], output=[src_clmorigid#21055])
:                    +- ShuffleQueryStage 21, true
:                       +- Exchange hashpartitioning(src_clmorigid#21055, 10)
:                          +- *(10) HashAggregate(keys=[src_clmorigid#21055], functions=[], output=[src_clmorigid#21055])
:                             +- *(10) Project [src_clmorigid#21055]
:                                +- *(10) BroadcastHashJoin [tgt_clmorigid#21152], [tgt_clmorigid#20756], Inner, BuildRight
:                                   :- *(10) Project [src_clmorigid#21055, tgt_clmorigid#21152]
:                                   :  +- *(10) Filter (isnotnull(tgt_clmorigid#21152) && isnotnull(src_clmorigid#21055))
:                                   :     +- *(10) FileScan parquet default.vw_exclude_latest_set_frm_clm[src_clmorigid#21055,tgt_clmorigid#21152] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://dm_bucket...
:                                   +- ReusedQueryStage 20
:                                      +- BroadcastQueryStage 6
:                                         +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
:                                            +- AdaptiveSparkPlan(isFinalPlan=true)
:                                               +- *(9) HashAggregate(keys=[tgt_clmorigid#20756], functions=[], output=[tgt_clmorigid#20756])
:                                                  +- ShuffleQueryStage 19, true
:                                                     +- Exchange hashpartitioning(tgt_clmorigid#20756, 10)
:                                                        +- *(8) HashAggregate(keys=[tgt_clmorigid#20756], functions=[], output=[tgt_clmorigid#20756])
:                                                           +- *(8) Project [tgt_clmorigid#20756]
:                                                              +- *(8) Filter ((((isnotnull(tgt_clm_line_type_ind#20783) && isnotnull(src_clm_line_type_ind#20686)) 
:                                                                 +- *(8) FileScan parquet default.vw_exclude_latest_set_frm_clm[src_clm_line_type_ind#20686,tgt_clmorigid#20756,tgt_clm_line_type_ind#20783] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://...PushedFilters: [IsNotNull(tgt_clm_line_type_ind), 
                                                                      +- *(41) Project [vw_clm_base_fact_sk#21807, source_system#21808, eff_date#21809, frst_sales_crtn_dt#21810, clmorigid#21811, ... 59 more fields]
                                                                          +- *(41) FileScan parquet default.vw_to_be_merged_data[vw_clm_base_fact_sk#21807,source_system#21808,eff_date#21809,frst_sales_crtn_dt#21810,... 56 more fields], ...

Happy to provide additional information if required.

score 0 · Answer 1 · answered Dec 15 '22 at 18:09

This stages are connected to AQE (Adaptive Query Execution). If you execute your code wit AQE they should disappear

ShuffleQueryStage - this is added after exchange, its used to materialize results from previous stage (so exchange) to allow AQE to use runtime statistics and reoptimize plan

ReusedQueryStage - this means that this branch of execution is already in your plan and it can be reused. In source code i found a comment which says that order of queries may be not intuitive

I know it may sound weird because AQE was officialy released in Spark 3 but actually it was available from Spark 2.2.x (in a different shape than it works now) so its possible that someone enabled AQE on your cluster with Spark 2.4.7 and you see this stages

I added example with analysis for ShuffleQueryStage in other answer, you may take a look: Other answer

ShuffleQueryStage and ReusedQueryStage in Spark SQL query Plans

1 Answers1