1

What exactly does spark.sql.shuffle.partitions refer to? Are we talking of the number of partitions that is the results of a wide transformation, or something that happens in the middle as in some sort of intermediary partitioning before the result partition of the wide transformation?

Because in my understanding, as per a wide transformation we have

Parents RDDs -> shuffle files -> Child RDDs

What does the spark.sql.shuffle.partitions parameter refer to here? The shuffles files or the CHILD RDDs or something else that I ignored?

ZygD
  • 22,092
  • 39
  • 79
  • 102
MaatDeamon
  • 9,532
  • 9
  • 60
  • 127

1 Answers1

1

This is already explained in the official docs:

spark.sql.shuffle.partitions 200 Configures the number of partitions to use when shuffling data for joins or aggregations.

In other words it is the number of partitions of the child Dataset.

vinsce
  • 1,271
  • 1
  • 10
  • 19