15

I have a dataframe and I want to randomize rows in the dataframe. I tried sampling the data by giving a fraction of 1, which didn't work (interestingly this works in Pandas).

zero323
  • 322,348
  • 103
  • 959
  • 935
harshit
  • 3,788
  • 3
  • 31
  • 54

2 Answers2

26

It works in Pandas because taking sample in local systems is typically solved by shuffling data. Spark from the other hand avoids shuffling by performing linear scans over the data. It means that sampling in Spark only randomizes members of the sample not an order.

You can order DataFrame by a column of random numbers:

from pyspark.sql.functions import rand 

df = sc.parallelize(range(20)).map(lambda x: (x, )).toDF(["x"])
df.orderBy(rand()).show(3)

## +---+
## |  x|
## +---+
## |  2|
## |  7|
## | 14|
## +---+
## only showing top 3 rows

but it is:

  • expensive - because it requires full shuffle and it something you typically want to avoid.
  • suspicious - because order of values in a DataFrame is not something you can really depend on in non-trivial cases and since DataFrame doesn't support indexing it is relatively useless without collecting.
zero323
  • 322,348
  • 103
  • 959
  • 935
3

This code works for me without any RDD operations:

import pyspark.sql.functions as F
df = df.select("*").orderBy(F.rand())

Here is a more elaborated example:

import pyspark.sql.functions as F
# Example: create a Dataframe for the example
pandas_df = pd.DataFrame(([1,2],[3,1],[4,2],[7,2],[32,7],[123,3]),columns=["id","col1"])
df = sqlContext.createDataFrame(pandas_df)

    
df = df.select("*").orderBy(F.rand())

df.show()

+---+----+
| id|col1|
+---+----+
|  1|   2|
|  3|   1|
|  4|   2|
|  7|   2|
| 32|   7|
|123|   3|
+---+----+

df.select("*").orderBy(F.rand()).show()


+---+----+
| id|col1|
+---+----+
|  7|   2|
|123|   3|
|  3|   1|
|  4|   2|
| 32|   7|
|  1|   2|
+---+----+
Adrian Mole
  • 49,934
  • 160
  • 51
  • 83
Rayanaay
  • 85
  • 1
  • 9