0

I have a val dataset = Dataset[FeedData], where FeedData is something like case class FeedData(feed: String, data: XYZ).

I want to avoid post-processing the files, so I decided to call dataset.repartition($"feed").json("s3a://...") so that each feed ends up in a different file. The problem is that the files are still named along the lines of part-XXXX so I can't easily pick out the relevant file for a given feed, without a) opening them all to check the values for feed inside, or b) post-processing the files to be more friendly.

I want the files to look like part-XXXX-{feed} instead of part-XXXX

Is it possible to dynamically name the partition files based on the value of the column feed used to partition the dataset?

Background:

I found this answer which mentions a saveAsNewAPIHadoopFile() method, where I can extend some relevant classes for my own file naming implementation.

Can anybody help me understand this method, how to access it from a Dataset, and tell me whether it's possible to project the required information (feed) into my implementation to dynamically name the partitions?

Rory Byrne
  • 923
  • 1
  • 12
  • 22

1 Answers1

0

I was trying to do it the wrong way:

dataset.repartition($"colName").write.format("json").save(path)

The correct way to do this is:

dataset.write.partitionBy("colName").format("json").save(path)


The difference is that you should call .partitionBy after .write. The resulting directories look like: colName=value/part-XXXX.

See here for more info.

Rory Byrne
  • 923
  • 1
  • 12
  • 22
  • They actually serve quite different purposes, see https://stackoverflow.com/questions/40416357/spark-sql-difference-between-df-repartition-and-dataframewriter-partitionby/40417992 – mazaneicha Dec 03 '19 at 13:46