2

I have a large partitioned Iceberg table ordered by some columns. Now I want to scan through some filtered parts of that table using Spark and toLocalIterator(), preserving the order.

When my filter condition outputs the data from single partition everything is OK - the rows are ordered as expected.

The problem happens when there are multiple partitions in result - they come to me in random order. Of course I can add ORDER BY to my select statement, but that triggers expensive sorting, which is totally unnecessary if I only could explicitly specify the order for partitions.

The question is: how to tell Spark to use that order (or some other order)? Or broader: how to leverage from ordering columns in Iceberg schema?

0 Answers0