I have a val dataset = Dataset[FeedData]
, where FeedData
is something like case class FeedData(feed: String, data: XYZ)
.
I want to avoid post-processing the files, so I decided to call dataset.repartition($"feed").json("s3a://...")
so that each feed
ends up in a different file. The problem is that the files are still named along the lines of part-XXXX
so I can't easily pick out the relevant file for a given feed, without a) opening them all to check the values for feed
inside, or b) post-processing the files to be more friendly.
I want the files to look like part-XXXX-{feed}
instead of part-XXXX
Is it possible to dynamically name the partition files based on the value of the column feed
used to partition the dataset?
Background:
I found this answer which mentions a saveAsNewAPIHadoopFile()
method, where I can extend some relevant classes for my own file naming implementation.
Can anybody help me understand this method, how to access it from a Dataset
, and tell me whether it's possible to project the required information (feed
) into my implementation to dynamically name the partitions?