How to repartition Spark dataframe depending on row count?

Question

I wrote a simple program that request a huge database. To export my result, I wrote this function:

result.coalesce(1).write.options(Map("header" -> "true", "delimiter"  > ";")).csv(mycsv.csv)

I use the coalesce method to have only get one file as an output. The problem is that the result file contains more than one million lines. So, I couldn't open it in Excel...

So, I thought about using a method (or write my own function using a for loop) that can create partitions related to the number of the lines in my file. But I have no idea how can I do this.

My idea is that if I have less than one million line, I will have one partition. If I have more than one million => two partitions, 2 millions => 3 partitions and so on.

Is it possible to do something like this?

score 11 · Accepted Answer · answered Jan 21 '19 at 10:13

11

You can change the number of partition depending on the number of rows in the dataframe.

For example:

val rowsPerPartition = 1000000
val partitions = (1 + df.count() / rowsPerPartition).toInt

val df2 = df.repartition(numPartitions=partitions)

Then write the new dataframe to a csv file as before.

Note: it may be required to use repartition instead of coalesce to make sure the number of rows in each partition are roughly equal, see Spark - repartition() vs coalesce().

answered Jan 21 '19 at 10:13

Shaido

27,497
23
70
73

1

Any idea how to do that without having to call 2 actions, which could be expensive computationally? – Timmeh Oct 23 '20 at 16:27
@Timmeh: if you know approximately how large your dataframe is then there is no need for the count. Otherwise, I don't think there is any alternative if you want to adjust the number of partitions depending on the size. – Shaido Oct 24 '20 at 13:45
Yeah, unfortunately it's a dynamic job, so size of output can very significantly. I found the maxRecordsPerFile option which might be sufficient – Timmeh Oct 25 '20 at 14:44

How to repartition Spark dataframe depending on row count?

1 Answers1

Linked