I'm a beginner in programming with PySpark. I have the following data in a CSV file which is being read into a Spark Dataframe and I would like to generate a large dataset starting from a small one.
# read the csv file in a spark dataframe
df = (spark.read
.option("inferSchema", "true")
.option("header", "true")
.csv(file_path))
I want to shuffle the data in each of the columns i.e. 'InvoiceNo', 'StockCode', 'Description'respectively as shown below in snapshot.
The below code was implemented to orderBy column values randomly:
from pyspark.sql.functions import *
df.orderBy("InvoiceNo", rand()).show(10)
I'm not getting the correct output even after executing the above. Can anyone help in solving the problem? This link was also referred : Randomly shuffle column in Spark RDD or dataframe but the code mentioned is throwing an error.