I have a dataframe say df1 with 10M rows. I want to split the same to multiple csv files with 1M rows each. Any suggestions to do the same in scala?
Asked
Active
Viewed 2,456 times
1 Answers
1
You can use the randomSplit method on Dataframes.
import scala.util.Random
val df = List(0,1,2,3,4,5,6,7,8,9).toDF
val splitted = df.randomSplit(Array(1,1,1,1,1))
splitted foreach { a => a.write.format("csv").save("path" + Random.nextInt) }
I used the Random.nextInt to have a unique name. You can add some other logic there if necessary.
Source:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset
How to save a spark DataFrame as csv on disk?
https://forums.databricks.com/questions/8723/how-can-i-split-a-spark-dataframe-into-n-equal-dat.html
Edit: An alternative approach is to use limit and except:
var input = List(1,2,3,4,5,6,7,8,9).toDF
val limit = 2
var newFrames = List[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]]()
var size = input.count;
while (size > 0) {
newFrames = input.limit(limit) :: newFrames
input = input.except(newFrames.head)
size = size - limit
}
newFrames.foreach(_.show)
The first element in the resulting list may contain less element than the rest of the list.

Steffen Schmitz
- 860
- 3
- 16
- 34
-
@Steffen..My requirement is to have fixed number of rows per csv. Also, the number if records in csv are not fixed. If the master file has 10M rows, 10 csv's of 1M records each should be created. Similarly for 20M records 20 csv's of 1M records should be created. This example does not suffice this problem. – Nitish Apr 24 '17 at 12:33
-
http://stackoverflow.com/questions/41223125/how-to-split-a-spark-dataframe-with-equal-records This provides an example in scala code on how to do this. The number of partitions should be the length of your dataset divided by the number of rows per partition. – Steffen Schmitz Apr 24 '17 at 20:17
-
@Nitish I added an approach that may solve your problem based on an answer to this question: https://stackoverflow.com/questions/44135610/spark-scala-split-dataframe-into-equal-number-of-rows – Steffen Schmitz May 24 '17 at 13:31