23

I have a dataframe like this:

+---+---+
|_c0|_c1|
+---+---+
|1.0|4.0|
|1.0|4.0|
|2.1|3.0|
|2.1|3.0|
|2.1|3.0|
|2.1|3.0|
|3.0|6.0|
|4.0|5.0|
|4.0|5.0|
|4.0|5.0|
+---+---+

and I would like to shuffle all the rows using Spark in Scala.

How can I do this without going back to RDD?

Brian McCutchon
  • 8,354
  • 3
  • 33
  • 45
Laure D
  • 857
  • 2
  • 9
  • 16

1 Answers1

60

You need to use orderBy method of the dataframe:

import org.apache.spark.sql.functions.rand
val shuffledDF = dataframe.orderBy(rand())
prudenko
  • 1,663
  • 13
  • 19
  • 2
    how does this scale on very large datasets? Is there a more efficient way? – Chris A. Oct 30 '19 at 00:16
  • @ChrisA. if you want a "true" shuffle then you have to move data across the network. E.g. each row has equal chances to be at any place in dataset. But if you need just to shuffle within partition, you can use: `df.mapPartitions(new scala.util.Random().shuffle(_))` - then no network shuffle would be involved. But if you have just 1 row in a partition - then no shuffle would be at all. – prudenko Oct 31 '19 at 12:33
  • @prudenko this `mapPartitions` did not work on a recent version of spark. `error: Unable to find encoder for type org.apache.spark.sql.Row.` – Merlin Nov 29 '20 at 05:43