1

I'm a beginner in programming with PySpark. I have the following data in a CSV file which is being read into a Spark Dataframe and I would like to generate a large dataset starting from a small one.

enter image description here

# read the csv file in a spark dataframe
df = (spark.read
       .option("inferSchema", "true")
       .option("header", "true")
       .csv(file_path))

I want to shuffle the data in each of the columns i.e. 'InvoiceNo', 'StockCode', 'Description'respectively as shown below in snapshot.

enter image description here

The below code was implemented to orderBy column values randomly:

from pyspark.sql.functions import *

df.orderBy("InvoiceNo", rand()).show(10)

I'm not getting the correct output even after executing the above. Can anyone help in solving the problem? This link was also referred : Randomly shuffle column in Spark RDD or dataframe but the code mentioned is throwing an error.

Chris
  • 1,335
  • 10
  • 19
user39602
  • 339
  • 2
  • 5
  • 13

1 Answers1

4

The PySpark rand function can be used to create a column of random values on your dataframe. The dataframe can then be ordered by the new column to produce the randomised order e.g.

from pyspark.sql.functions import rand

df.withColumn('rand', rand(seed=42)).orderBy('rand')

To randomise the order of each column individually, create a Dataframe for each column separately and randomise that, each with a unique seed e.g.

col_1_df = df.select('col_1').withColumn('rand', rand(seed=seed_1)).orderBy('rand')
col_2_df = df.select('col_2').withColumn('rand', rand(seed=seed_2)).orderBy('rand')

To recompose a Dataframe with the original columns, you could add a row number and then join on that e.g.

window = Window().orderBy(lit('A'))
col_1_with_row_num = col_1_df.withColumn("row_num", row_number().over(window))
col_2_with_row_num = col_2_df.withColumn("row_num", row_number().over(window))

col_1_with_row_num.join(col_2_with_row_num, on=['row_num']).select('col_1', 'col_2').show()
Chris
  • 1,335
  • 10
  • 19
  • 1
    Explanation added. – Chris May 11 '20 at 09:09
  • This is not giving the correct solution. The randomization is getting done on the dataframe row object and not on separate dataframe columns which is the intended goal. – user39602 May 11 '20 at 09:37
  • Individual column shuffle and recombination added. – Chris May 11 '20 at 10:53
  • Is there a specific need to add a column with row_num? Can we not do it without the same? – user39602 May 11 '20 at 11:10
  • There may be other ways to recombine the dataframes, using a row_num column is one way. – Chris May 11 '20 at 11:19
  • 2
    I think the question would benefit from explaining why you want to do, what seems like, an usual thing i.e. break the integrity of each row by shuffling each column like a slot machine. – Chris May 11 '20 at 14:29
  • I'm looking out for a way to generate large dataset having a small dataset in hand. – user39602 May 11 '20 at 14:32
  • I recommend you explain that in your question, for others that find it. – Chris May 11 '20 at 14:54
  • Done. Although, I have mentioned in the beginning of the question that I'm new to concepts of Spark. – user39602 May 11 '20 at 15:00