0

I'm using pyspark sql functions from version 2.5.4. I have the following data in a pyspark.sql.dataframe:

 df = spark.createDataFrame(
    [
        (302, 'foo'), # values
        (203, 'bar'),
        (202, 'foo'),
        (202, 'bar'),
        (172, 'xxx'),
        (172, 'yyy'),
    ],
    ['LU', 'input'] # column labels
)

display(df)

What I would like to do is create a separate csv file for each 'LU'. So the csv's would look like this:

LU_302.csv

 LU_302 = spark.createDataFrame(
    [
        (302, 'foo'), # values
    ],
    ['LU', 'input'] # column labels
)

LU_203.csv

 LU_203 = spark.createDataFrame(
    [
        (203, 'bar'), # values
    ],
    ['LU', 'input'] # column labels
)

LU_202.csv

 LU_202 = spark.createDataFrame(
    [
        (202, 'foo'), # values
        (202, 'bar'), # values
    ],
    ['LU', 'input'] # column labels
)

LU_172.csv

 LU_202 = spark.createDataFrame(
    [
        (172, 'xxx'), # values
        (172, 'yyy'), # values
    ],
    ['LU', 'input'] # column labels
)

My separated dataframes here are spark dataframes but I would like them to be in csv - this is just for illustration purposes.

So you can see the dataframe has been split into separate dataframes using the 'LU' variable. I've been looking into how to do this using a while loop that runs over the dataframe and prints a new csv to a file path but can't find a solution.

Thanks

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
Mrmoleje
  • 453
  • 1
  • 12
  • 35
  • https://stackoverflow.com/questions/60048027/how-to-manage-physical-data-placement-of-a-dataframe-across-the-cluster-with-pys/60048672#60048672 that should help. u dont use loops in spark, you have to use inbuilt functions to do work in parallel. – murtihash Apr 07 '20 at 15:31

1 Answers1

1

You can save the dataframe by using partition, like:

df.coalesce(1).write.partitionBy('LU').format('csv').option('header','true').save(file-path) 
Rahul
  • 717
  • 9
  • 16