What are the Spark transformations that cause a shuffle on Dataframes?

Question

I read in several places that transformations that include a shuffling stage should be avoided when possible since shuffling involves sending data over the network between the nodes, which can have a high performance cost on a program.

I was looking for a list of Spark transformations that might cause shuffling on Spark's 2.4+ dataframes, and all I came up with is this this question regarding the old RDD API.

David Vrba · Accepted Answer · 2019-12-16T14:23:20.710

Here is a list of transformations from DataFrame API (current version of PySpark 2.4.4 and corresponding functions also in Scala API) which may in general induce a shuffle (but not necessarily, in reality it depends on how your data is prepared (bucketed) or partitioned from some previous transformation):

join (if planned as SortMergeJoin)
data deduplication using distinct / dropDuplicates
aggregation using groupBy
aggregation with window functions using Window.partitionBy()
explicit data repartition using repartition / repartitionByRange function
global sorting using orderBy / sort transformation
subtracting two dataframes using subtract
counting distinct values in a column using countDistinct

does spark do internal `repartition` as well based on `spark.sql.shuffle.partitions` property ? and that cause shuffle(exchange). I have seen exchange in DAG without any from the above list you mention in my code. — nir, Jul 06 '23 at 20:26

What are the Spark transformations that cause a shuffle on Dataframes?

1 Answers1