0

If there was a Spark RDD like such:

id  | data
----------
1   | "a"
1   | "b"
2   | "c"
3   | "d"

How could I output this to separate json textfiles, grouped based on the id? Such that part-0000-1.json would contain rows "a" and "b", part-0000-2.json contains "c", etc.

zero323
  • 322,348
  • 103
  • 959
  • 935
trextomcat
  • 135
  • 3
  • 10

2 Answers2

2
df.write.partitionBy("col").json(<path_to_file>)

is what you need.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
  • Thanks for your reply! I've seen this solution before, but the `partionBy()` creates a new directory where it collects separate files, whereas I'm looking to collect all rows to a single file. I hope that clarifies – trextomcat Dec 08 '18 at 23:27
  • Thanks for putting me back on the right track. I've since found the solution and posted my answer. – trextomcat Dec 08 '18 at 23:41
0

Thanks to @thebluephantom, I was able to understand what was going wrong.

I was fundamentally misunderstanding Spark. When I was initially doing df.write.partitionBy("col").json(<path_to_file>) as @thebluephantom suggested, I was confused as to why my output was split into many different files.

I have since added .repartition(1) to collect all data into a single node, and then partitionBy("col") to split the data in here to multiple file outputs. My final code is:

latestUniqueComments
  .repartition(1)
  .write
  .mode(SaveMode.Append)
  .partitionBy("_manual_file_id")
  .format("json")
  .save(outputFile)
trextomcat
  • 135
  • 3
  • 10