How to shuffle the rows in a Spark dataframe?

Question

I have a dataframe like this:

+---+---+
|_c0|_c1|
+---+---+
|1.0|4.0|
|1.0|4.0|
|2.1|3.0|
|2.1|3.0|
|2.1|3.0|
|2.1|3.0|
|3.0|6.0|
|4.0|5.0|
|4.0|5.0|
|4.0|5.0|
+---+---+

and I would like to shuffle all the rows using Spark in Scala.

How can I do this without going back to RDD?

score 60 · Accepted Answer · answered Apr 26 '17 at 15:16

60

You need to use orderBy method of the dataframe:

import org.apache.spark.sql.functions.rand
val shuffledDF = dataframe.orderBy(rand())

answered Apr 26 '17 at 15:16

prudenko

2

how does this scale on very large datasets? Is there a more efficient way? – Chris A. Oct 30 '19 at 00:16
@ChrisA. if you want a "true" shuffle then you have to move data across the network. E.g. each row has equal chances to be at any place in dataset. But if you need just to shuffle within partition, you can use: `df.mapPartitions(new scala.util.Random().shuffle(_))` - then no network shuffle would be involved. But if you have just 1 row in a partition - then no shuffle would be at all. – prudenko Oct 31 '19 at 12:33
@prudenko this `mapPartitions` did not work on a recent version of spark. `error: Unable to find encoder for type org.apache.spark.sql.Row.` – Merlin Nov 29 '20 at 05:43

1 Answers1