I'm using pyspark sql functions from version 2.5.4. I have the following data in a pyspark.sql.dataframe:
df = spark.createDataFrame(
[
(302, 'foo'), # values
(203, 'bar'),
(202, 'foo'),
(202, 'bar'),
(172, 'xxx'),
(172, 'yyy'),
],
['LU', 'input'] # column labels
)
display(df)
What I would like to do is create a separate csv file for each 'LU'. So the csv's would look like this:
LU_302.csv
LU_302 = spark.createDataFrame(
[
(302, 'foo'), # values
],
['LU', 'input'] # column labels
)
LU_203.csv
LU_203 = spark.createDataFrame(
[
(203, 'bar'), # values
],
['LU', 'input'] # column labels
)
LU_202.csv
LU_202 = spark.createDataFrame(
[
(202, 'foo'), # values
(202, 'bar'), # values
],
['LU', 'input'] # column labels
)
LU_172.csv
LU_202 = spark.createDataFrame(
[
(172, 'xxx'), # values
(172, 'yyy'), # values
],
['LU', 'input'] # column labels
)
My separated dataframes here are spark dataframes but I would like them to be in csv - this is just for illustration purposes.
So you can see the dataframe has been split into separate dataframes using the 'LU' variable. I've been looking into how to do this using a while loop that runs over the dataframe and prints a new csv to a file path but can't find a solution.
Thanks