Spark repartiton, sortWithinParitions and then partitionBy during write is messing with my sort

Question

I have scala spark code that writes a dataframe to csv files. The code is shown below

dataframe
  .select("path", "id", "top_path")
  .repartition(1, col("top_path"))
  .sortWithinPartitions("path")
  .write
  .partitionBy("top_path")
  .option("delimiter", "\t")
  .mode("overwrite")
  .csv(outputPath)

Even though I am doing a sortWithinPartitions on the path column, I am still seeing that some of the output isn't sorted as expected. Does anyone know why this is happening and how it can be fixed? I have tried sortWithinPartitions("top_path", "path") but that still didn't sort by path properly when writing. I expect sorting to occur in ascending order by path. For example in some cases I am seeing output like

path1 1
path1/subpath1 2
path1 3
path1/subpath2 4

instead of

path1 1
path1 3
path1/subpath1 2
path1/subpath2 4

@thebluephantom I've added examples. I hope it's good as it is — Yanki Twizzy, Mar 06 '22 at 15:24
OK, but what are you getting; i think i know, but handy to show expected vs actual output — thebluephantom, Mar 06 '22 at 15:30
repartition(1, col) means 1 partition per unique col. It makes everything to be written into 1 file for each unique partitoned by col — Yanki Twizzy, Mar 06 '22 at 15:40
Does this answer your question? [Difference between df.repartition and DataFrameWriter partitionBy?](https://stackoverflow.com/questions/40416357/difference-between-df-repartition-and-dataframewriter-partitionby) (The answer with the highest score, not the accepted one.) — mazaneicha, Mar 06 '22 at 20:50

score 0 · Answer 1 · answered Mar 08 '22 at 08:18

0

My guess would be that partitionBy resets any order that you had before. Try partitionBy and then sortBy

answered Mar 08 '22 at 08:18

Yaroslav Fyodorov

689
4
10

1

sortBy needs to be used together with bucketBy and I don't want to bucket before writing – Yanki Twizzy Mar 08 '22 at 18:50

Spark repartiton, sortWithinParitions and then partitionBy during write is messing with my sort

1 Answers1