0

I have scala spark code that writes a dataframe to csv files. The code is shown below

dataframe
  .select("path", "id", "top_path")
  .repartition(1, col("top_path"))
  .sortWithinPartitions("path")
  .write
  .partitionBy("top_path")
  .option("delimiter", "\t")
  .mode("overwrite")
  .csv(outputPath)

Even though I am doing a sortWithinPartitions on the path column, I am still seeing that some of the output isn't sorted as expected. Does anyone know why this is happening and how it can be fixed? I have tried sortWithinPartitions("top_path", "path") but that still didn't sort by path properly when writing. I expect sorting to occur in ascending order by path. For example in some cases I am seeing output like

path1 1
path1/subpath1 2
path1 3
path1/subpath2 4

instead of

path1 1
path1 3
path1/subpath1 2
path1/subpath2 4
Yanki Twizzy
  • 7,771
  • 8
  • 41
  • 68

1 Answers1

0

My guess would be that partitionBy resets any order that you had before. Try partitionBy and then sortBy