How to shuffle the data in each of the columns of a PySpark DataFrame?

Question

I'm a beginner in programming with PySpark. I have the following data in a CSV file which is being read into a Spark Dataframe and I would like to generate a large dataset starting from a small one.

# read the csv file in a spark dataframe
df = (spark.read
       .option("inferSchema", "true")
       .option("header", "true")
       .csv(file_path))

I want to shuffle the data in each of the columns i.e. 'InvoiceNo', 'StockCode', 'Description'respectively as shown below in snapshot.

The below code was implemented to orderBy column values randomly:

from pyspark.sql.functions import *

df.orderBy("InvoiceNo", rand()).show(10)

I'm not getting the correct output even after executing the above. Can anyone help in solving the problem? This link was also referred : Randomly shuffle column in Spark RDD or dataframe but the code mentioned is throwing an error.

Chris · Answer 1 · 2021-02-09T11:57:51.483

4

The PySpark rand function can be used to create a column of random values on your dataframe. The dataframe can then be ordered by the new column to produce the randomised order e.g.

from pyspark.sql.functions import rand

df.withColumn('rand', rand(seed=42)).orderBy('rand')

To randomise the order of each column individually, create a Dataframe for each column separately and randomise that, each with a unique seed e.g.

col_1_df = df.select('col_1').withColumn('rand', rand(seed=seed_1)).orderBy('rand')
col_2_df = df.select('col_2').withColumn('rand', rand(seed=seed_2)).orderBy('rand')

To recompose a Dataframe with the original columns, you could add a row number and then join on that e.g.

window = Window().orderBy(lit('A'))
col_1_with_row_num = col_1_df.withColumn("row_num", row_number().over(window))
col_2_with_row_num = col_2_df.withColumn("row_num", row_number().over(window))

col_1_with_row_num.join(col_2_with_row_num, on=['row_num']).select('col_1', 'col_2').show()

edited Feb 09 '21 at 11:57

answered May 11 '20 at 07:45

Chris

1,335
10
19

1

Explanation added. – Chris May 11 '20 at 09:09
This is not giving the correct solution. The randomization is getting done on the dataframe row object and not on separate dataframe columns which is the intended goal. – user39602 May 11 '20 at 09:37
Individual column shuffle and recombination added. – Chris May 11 '20 at 10:53
Is there a specific need to add a column with row_num? Can we not do it without the same? – user39602 May 11 '20 at 11:10
There may be other ways to recombine the dataframes, using a row_num column is one way. – Chris May 11 '20 at 11:19
2

I think the question would benefit from explaining why you want to do, what seems like, an usual thing i.e. break the integrity of each row by shuffling each column like a slot machine. – Chris May 11 '20 at 14:29
I'm looking out for a way to generate large dataset having a small dataset in hand. – user39602 May 11 '20 at 14:32
I recommend you explain that in your question, for others that find it. – Chris May 11 '20 at 14:54
Done. Although, I have mentioned in the beginning of the question that I'm new to concepts of Spark. – user39602 May 11 '20 at 15:00

How to shuffle the data in each of the columns of a PySpark DataFrame?

1 Answers1

Linked