In Scala, how would I take a Spark RDD, and output to different files, grouped by the values of a column?

Question

If there was a Spark RDD like such:

id  | data
----------
1   | "a"
1   | "b"
2   | "c"
3   | "d"

How could I output this to separate json textfiles, grouped based on the id? Such that part-0000-1.json would contain rows "a" and "b", part-0000-2.json contains "c", etc.

score 2 · Accepted Answer · answered Dec 08 '18 at 23:24

2

df.write.partitionBy("col").json(<path_to_file>)

is what you need.

answered Dec 08 '18 at 23:24

thebluephantom

16,458
8
40
83

Thanks for your reply! I've seen this solution before, but the `partionBy()` creates a new directory where it collects separate files, whereas I'm looking to collect all rows to a single file. I hope that clarifies – trextomcat Dec 08 '18 at 23:27
Thanks for putting me back on the right track. I've since found the solution and posted my answer. – trextomcat Dec 08 '18 at 23:41

score 0 · Answer 2 · answered Dec 08 '18 at 23:40

Thanks to @thebluephantom, I was able to understand what was going wrong.

I was fundamentally misunderstanding Spark. When I was initially doing df.write.partitionBy("col").json(<path_to_file>) as @thebluephantom suggested, I was confused as to why my output was split into many different files.

I have since added .repartition(1) to collect all data into a single node, and then partitionBy("col") to split the data in here to multiple file outputs. My final code is:

latestUniqueComments
  .repartition(1)
  .write
  .mode(SaveMode.Append)
  .partitionBy("_manual_file_id")
  .format("json")
  .save(outputFile)

In Scala, how would I take a Spark RDD, and output to different files, grouped by the values of a column?

2 Answers2