Is it possible to dynamically name the part-XXXX files based on the value of the column used to partition the Dataset?

Question

I have a val dataset = Dataset[FeedData], where FeedData is something like case class FeedData(feed: String, data: XYZ).

I want to avoid post-processing the files, so I decided to call dataset.repartition($"feed").json("s3a://...") so that each feed ends up in a different file. The problem is that the files are still named along the lines of part-XXXX so I can't easily pick out the relevant file for a given feed, without a) opening them all to check the values for feed inside, or b) post-processing the files to be more friendly.

I want the files to look like part-XXXX-{feed} instead of part-XXXX

Is it possible to dynamically name the partition files based on the value of the column feed used to partition the dataset?

Background:

I found this answer which mentions a saveAsNewAPIHadoopFile() method, where I can extend some relevant classes for my own file naming implementation.

Can anybody help me understand this method, how to access it from a Dataset, and tell me whether it's possible to project the required information (feed) into my implementation to dynamically name the partitions?

score 0 · Answer 1 · answered Dec 03 '19 at 13:31

0

I was trying to do it the wrong way:

dataset.repartition($"colName").write.format("json").save(path)

The correct way to do this is:

dataset.write.partitionBy("colName").format("json").save(path)

The difference is that you should call .partitionBy after .write. The resulting directories look like: colName=value/part-XXXX.

See here for more info.

answered Dec 03 '19 at 13:31

Rory Byrne

923
1
12
22

They actually serve quite different purposes, see https://stackoverflow.com/questions/40416357/spark-sql-difference-between-df-repartition-and-dataframewriter-partitionby/40417992 – mazaneicha Dec 03 '19 at 13:46

Is it possible to dynamically name the part-XXXX files based on the value of the column used to partition the Dataset?

1 Answers1