0

I read in several places that transformations that include a shuffling stage should be avoided when possible since shuffling involves sending data over the network between the nodes, which can have a high performance cost on a program.

I was looking for a list of Spark transformations that might cause shuffling on Spark's 2.4+ dataframes, and all I came up with is this this question regarding the old RDD API.

David Taub
  • 734
  • 1
  • 7
  • 27

1 Answers1

3

Here is a list of transformations from DataFrame API (current version of PySpark 2.4.4 and corresponding functions also in Scala API) which may in general induce a shuffle (but not necessarily, in reality it depends on how your data is prepared (bucketed) or partitioned from some previous transformation):

  • join (if planned as SortMergeJoin)
  • data deduplication using distinct / dropDuplicates
  • aggregation using groupBy
  • aggregation with window functions using Window.partitionBy()
  • explicit data repartition using repartition / repartitionByRange function
  • global sorting using orderBy / sort transformation
  • subtracting two dataframes using subtract
  • counting distinct values in a column using countDistinct
David Vrba
  • 2,984
  • 12
  • 16
  • does spark do internal `repartition` as well based on `spark.sql.shuffle.partitions` property ? and that cause shuffle(exchange). I have seen exchange in DAG without any from the above list you mention in my code. – nir Jul 06 '23 at 20:26