spark repartition to one output file per customer

Question

Assume I have a dataframe like:

client_id,report_date,date,value_1,value_2
1,2019-01-01,2019-01-01,1,2
1,2019-01-01,2019-01-02,3,4
1,2019-01-01,2019-01-03,5,6
2,2019-01-01,2019-01-01,1,2
2,2019-01-01,2019-01-02,3,4
2,2019-01-01,2019-01-03,5,6

My desired output structure would be a CSV or JSON with:

results/
   client_id=1/
      report_date=2019-01-01
        <<somename>>.csv
   client_id=2/
      report_date=2019-01-01
        <<somename>>.csv

To achieve this I use

df.repartition(2, "customer_id", "report_date")
  .sortWithinPartitions("date", "value1")
  .write.partitionBy("customer_id", "report_date")
  .csv(...)

However, instead of the desired single file per client and report date (partition) I end up with two.

Spark SQL - Difference between df.repartition and DataFrameWriter partitionBy? explains why. However, using a repartition(1) would work. But in case the number of customer_id is large could run into OOM. Is there still a way to achieve the desired result? The file per client_id is small.

So, what is the question? Customer_id is large but then you state is small... — thebluephantom, Jan 20 '19 at 17:40
The data per customer is mall but there are many customers. I want to end up with one single file per customer. My initial strategy was to use spark partitions. However they would only work if repartition 1 is executed. But this will not work as there are too many customers. So is there a different option? — Georg Heiler, Jan 20 '19 at 18:11
But in your example that is what I see. What if there are 2 dates for the same customer? — thebluephantom, Jan 20 '19 at 18:14
That is correct. But differentiate `report_date` and `date`. Partitioning os happending per customer_id and 'report_date' i.e. there should be a file per customer_id and report_date. — Georg Heiler, Jan 20 '19 at 18:24
But that's what I see happening, I just tried it and got that as you state. So, what to think? On a small sample. — thebluephantom, Jan 20 '19 at 18:57
You do say 1 single file per customer - which is a little confusing. — thebluephantom, Jan 20 '19 at 19:02
On a large cluster not single node installation I get multiple files except using repartition 1 — Georg Heiler, Jan 20 '19 at 19:04
OK, but I do not really get that as repartition is for DF prior to write via df.write. — thebluephantom, Jan 20 '19 at 19:08
You need to be clear: 1 file per customer or 1 file per customer and reporting date. — thebluephantom, Jan 20 '19 at 19:18
I ran various tests up to 60M rows of simulated input and I noted I always got your second case. May be the volume I have tried is not enough. But I can see no issue. — thebluephantom, Jan 20 '19 at 20:12
No, but there is a limit to partition size for Spark. 2GB. Is that what vthe question is about? repartition 1 seems a bad strategy if you have a lot of data. — thebluephantom, Jan 20 '19 at 20:51
Exactly. So is there a way to stil get the desired result without repartition 1? — Georg Heiler, Jan 20 '19 at 20:55
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/187018/discussion-between-thebluephantom-and-georg-heiler). — thebluephantom, Jan 20 '19 at 20:59
I ran a large enough dataset on AWS EMR using m4.large with 3 slaves, a few hundred million simulated rows with repartition(3), and I never got more than one file in the partitioned directory for the combination you mentioned. You say the actual distribution of data is low, so I am a little surprised. I could image that if a file gets too big there will be multiple files but blocks should handle that, but you state - again - that distribution per partitioning keys is low. So, I am not sure. Additional info would need to be provided. I noted some files were ...part_0002. — thebluephantom, Jan 23 '19 at 21:31
On the other hand, you can in this case merge the files as repartition(1) is not a good strategy for throughput. — thebluephantom, Jan 23 '19 at 21:32
So why can I not get the same outcome? the df and writer have same format which should mean 1 fille as ouput. Pls show screenshot. Most curious. — thebluephantom, Jan 25 '19 at 08:36
I will not be able to share a screenshot. For now, explicitly calling `repartition(1)` gives the desired output and I managed to throw money at the problem, i.e. get large enough spark executors. But obviously it will not scale forever like this. — Georg Heiler, Jan 25 '19 at 08:40
Indeed and small files. But why can I not get it. May be too small set. But themanuals state this is not a good appoach. Rightly so. Bug? — thebluephantom, Jan 25 '19 at 10:01
I am not sure. But certainly using repartition 1 is not really scalable — Georg Heiler, Jan 25 '19 at 10:10
Any resolution here? I did a whole raft of tests on this topic and could not get your problem. — thebluephantom, Feb 20 '19 at 13:49
For now brute force with large enough executors. This will not scale forever, but it is good enough for now as RAM is rather cheap. — Georg Heiler, Feb 20 '19 at 13:50
Curious as may be I am missing something fundamentally, if you do repartition 1, then a partition may only be max 2GB. How does this work? — thebluephantom, Feb 20 '19 at 14:06
1) I believe spark 2.4 fixed this issue and 2) I repartition to > 1 but << spar.shuffle.paralellism (and unfortunately need to accept > 1 file per directory / partition — Georg Heiler, Feb 20 '19 at 14:11
`elimination of the 2GB block size limitation during transfer` from. https://spark.apache.org/releases/spark-release-2-4-0.html https://issues.apache.org/jira/browse/SPARK-1476 however, https://issues.apache.org/jira/browse/SPARK-6235 turns out to still be open. — Georg Heiler, Feb 20 '19 at 14:16
you mean my current workaround? or regarding 2.4 and 2GB limitation? — Georg Heiler, Feb 20 '19 at 14:17

spark repartition to one output file per customer

0 Answers0