As per Spark docs, spark.sql.shuffle.partitions
parameter Configures the number of partitions to use when shuffling data for joins or aggregations.
. To control the number of output files use the repartition()
method before writing the output. So something like this:
df
.filter(...) // some transformations
.join(...)
.repartition(1) // move data into a single partition
.write
.format(...)
.save(...)
The snippet above would result in a single output file.
You are not limited to repartitioning your data once - you can repartition as much as you need, but bare in mind that this is a costly operation:
df
.filter(...) // some transformations
.repartition(...) // repartition to improve join performance
.join(...)
.repartition(1) // move data into a single partition
.write
.format(...)
.save(...)
If you want a good explanation of how repartition
works, here is a great answer:
Spark - repartition() vs coalesce()
For more information on how to improve the performance of the joins, refer to the Spark docs:
https://spark.apache.org/docs/latest/sql-performance-tuning.html#join-strategy-hints-for-sql-queries