2

Assume I have a dataframe like:

client_id,report_date,date,value_1,value_2
1,2019-01-01,2019-01-01,1,2
1,2019-01-01,2019-01-02,3,4
1,2019-01-01,2019-01-03,5,6
2,2019-01-01,2019-01-01,1,2
2,2019-01-01,2019-01-02,3,4
2,2019-01-01,2019-01-03,5,6

My desired output structure would be a CSV or JSON with:

results/
   client_id=1/
      report_date=2019-01-01
        <<somename>>.csv
   client_id=2/
      report_date=2019-01-01
        <<somename>>.csv

To achieve this I use

df.repartition(2, "customer_id", "report_date")
  .sortWithinPartitions("date", "value1")
  .write.partitionBy("customer_id", "report_date")
  .csv(...)

However, instead of the desired single file per client and report date (partition) I end up with two.

Spark SQL - Difference between df.repartition and DataFrameWriter partitionBy? explains why. However, using a repartition(1) would work. But in case the number of customer_id is large could run into OOM. Is there still a way to achieve the desired result? The file per client_id is small.

Georg Heiler
  • 16,916
  • 36
  • 162
  • 292

0 Answers0