I'm trying to create a new column in a data frame that is simply a shuffled version of an existing column. I'm able to randomly order the rows in a data frame using the method described in How to shuffle the rows in a Spark dataframe?, but when I try to add the shuffled version of the column to a data frame, it appears to not perform the shuffling.
import pyspark
import pyspark.sql.functions as F
spark = pyspark.sql.SparkSession.builder.getOrCreate()
df = spark.range(5).toDF("x")
df.show()
#> +---+
#> | x|
#> +---+
#> | 0|
#> | 1|
#> | 2|
#> | 3|
#> | 4|
#> +---+
# the rows appear to be shuffled
ordered_df = df.orderBy(F.rand())
ordered_df.show()
#> +---+
#> | x|
#> +---+
#> | 0|
#> | 2|
#> | 3|
#> | 4|
#> | 1|
#> +---+
# ...but when i try to add this column to the df, they are no longer shuffled
df.withColumn('y', ordered_df.x).show()
#> +---+---+
#> | x| y|
#> +---+---+
#> | 0| 0|
#> | 1| 1|
#> | 2| 2|
#> | 3| 3|
#> | 4| 4|
#> +---+---+
Created on 2019-06-28 by the reprexpy package
A few notes:
- I would like to find a solution where the data remains in Spark. For example, I don't want to have to use a user-defined function that requires that the data be moved out of the JVM.
- The solution in PySpark: Randomize rows in dataframe didn't work for me (see below).
df = spark.sparkContext.parallelize(range(5)).map(lambda x: (x, )).toDF(["x"])
df.withColumn('y', df.orderBy(F.rand()).x).show()
#> +---+---+
#> | x| y|
#> +---+---+
#> | 0| 0|
#> | 1| 1|
#> | 2| 2|
#> | 3| 3|
#> | 4| 4|
#> +---+---+
- I have to shuffle the rows in many columns, and each column has to be shuffled independently of the others. As such, I would prefer to not use the
zipWithIndex()
solution in https://stackoverflow.com/a/45889539, as that solution would require that I run many joins on the data (which i'm assuming will be time-intensive).