Thanks to @thebluephantom, I was able to understand what was going wrong.
I was fundamentally misunderstanding Spark. When I was initially doing df.write.partitionBy("col").json(<path_to_file>)
as @thebluephantom suggested, I was confused as to why my output was split into many different files.
I have since added .repartition(1)
to collect all data into a single node, and then partitionBy("col")
to split the data in here to multiple file outputs. My final code is:
latestUniqueComments
.repartition(1)
.write
.mode(SaveMode.Append)
.partitionBy("_manual_file_id")
.format("json")
.save(outputFile)