I'm trying to query the MIN(dt)
in a table partitioned by dt
column using the following query in both Spark2 and Spark3:
SELECT MIN(dt) FROM table_name
The table is stored in parquet format in S3, where each dt
is a separate folder, so this seems like a pretty simple operation. There's about 3,200 days of data.
In Spark2, this query completes in ~1 minute, while in Spark3, the query takes over an hour (not sure how long exactly since it hasn't finished yet).
In Spark3, the execution plan is:
AdaptiveSparkPlan (10)
+- == Current Plan ==
HashAggregate (6)
+- ShuffleQueryStage (5)
+- Exchange (4)
+- * HashAggregate (3)
+- * ColumnarToRow (2)
+- Scan parquet table_name (1)
+- == Initial Plan ==
HashAggregate (9)
+- Exchange (8)
+- HashAggregate (7)
+- Scan parquet table_name (1)
It's confusing to me how this would take a long time, as the data is already partitioned by dt
. Spark only needs to determine which partitions have any rows and return the min of those.