Assume I have a dataframe like:
client_id,report_date,date,value_1,value_2
1,2019-01-01,2019-01-01,1,2
1,2019-01-01,2019-01-02,3,4
1,2019-01-01,2019-01-03,5,6
2,2019-01-01,2019-01-01,1,2
2,2019-01-01,2019-01-02,3,4
2,2019-01-01,2019-01-03,5,6
My desired output structure would be a CSV or JSON with:
results/
client_id=1/
report_date=2019-01-01
<<somename>>.csv
client_id=2/
report_date=2019-01-01
<<somename>>.csv
To achieve this I use
df.repartition(2, "customer_id", "report_date")
.sortWithinPartitions("date", "value1")
.write.partitionBy("customer_id", "report_date")
.csv(...)
However, instead of the desired single file per client and report date (partition) I end up with two.
Spark SQL - Difference between df.repartition and DataFrameWriter partitionBy? explains why.
However, using a repartition(1)
would work. But in case the number of customer_id
is large could run into OOM. Is there still a way to achieve the desired result? The file per client_id is small.