how to enable storage partitioned join in spark/iceberg?

Question

How do I use the storage partitioned join feature in Spark 3.3.0? I've tried it out, and my query plan still shows the expensive ColumnarToRow and Exchange steps. My setup is as follows:

joining two Iceberg tables, both partitioned on hours(ts), bucket(20, id)
join attempted on a.id = b.id AND a.ts = b.ts and on a.id = b.id
tables are large, 100+ partitions used, 100+ GB of data to join
spark: 3.3.0
iceberg: org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:0.14.1
set my spark session config with spark.sql.sources.v2.bucketing.enabled=true

I read through all the docs I could find on the storage partitioned join feature:

I'm wondering if there are other things I need to configure, if there needs to be something implemented in Iceberg still, or if I've set up something wrong. I'm super excited about this feature. It could really speed up some of our large joins.

score 3 · Answer 1 · answered Dec 24 '22 at 18:50

3

Support for storage-partitioned joins (SPJ) has been added to Iceberg in PR #6371 and will be released in 1.2.0. Keep in mind Spark added support for SPJ for v2 sources only in 3.3, so earlier versions can't benefit from this feature.

answered Dec 24 '22 at 18:50

Anton Okolnychyi

936
7
10

score 1 · Accepted Answer · answered Oct 19 '22 at 15:07

The support hasn't been implemented in Iceberg yet. In fact it looks like the work is proceeding as I'm typing: https://github.com/apache/iceberg/issues/430#issuecomment-1283014666

This answer should be updated when there's a release of Iceberg that supports Spark storage-partitioned joins.

score 0 · Answer 3 · answered May 23 '23 at 18:33

0

The PR to make Storage partitioned joins available in Apache Iceberg via Spark is now merged: https://github.com/apache/iceberg/pull/6371

answered May 23 '23 at 18:33

Dipankar Mazumdar

1

how to enable storage partitioned join in spark/iceberg?

3 Answers3