I have a large partitioned Iceberg table ordered by some columns. Now I want to scan through some filtered parts of that table using Spark and toLocalIterator()
, preserving the order.
When my filter condition outputs the data from single partition everything is OK - the rows are ordered as expected.
The problem happens when there are multiple partitions in result - they come to me in random order. Of course I can add ORDER BY
to my select statement, but that triggers expensive sorting, which is totally unnecessary if I only could explicitly specify the order for partitions.
The question is: how to tell Spark to use that order (or some other order)? Or broader: how to leverage from ordering columns in Iceberg schema?