I have an Iceberg table created with
CREATE TABLE catalog.db.table (a int, b int) USING iceberg
Then I apply some sort order on it
ALTER TABLE catalog.db.table WRITE ORDERED BY (a, b)
After invoking the last command, SHOW TBLPROPERTIES catalog.db.table
starts showing the write.distribution-mode: range
property:
|sort-order |a ASC NULLS FIRST, b ASC NULLS FIRST|
|write.distribution-mode|range |
Now I'm writing data into the table:
df = spark.createDataFrame([(i, i*4) for i in range(100000)], ["a", "b"]).coalesce(1).sortWithinPartitions("a", "b")
df.writeTo("datalakelocal.ixanezis.table").append()
I thought this should have created a single task in spark, which would sort all data in dataframe (in fact it is sorted since creation) and then insert it into the table as a single file too.
Unfortunately, at the moment of writing, spark decides to repartition all data which causes shuffling. I believe this happens due to write.distribution-mode: range
, which had been automatically set.
== Physical Plan ==
AppendData (6)
+- * Sort (5)
+- Exchange (4) # :(
+- * Project (3)
+- Coalesce (2)
+- * Scan ExistingRDD (1)
Is there a way to insert new data but also avoid unwanted shuffling?