I have scala spark code that writes a dataframe to csv files. The code is shown below
dataframe
.select("path", "id", "top_path")
.repartition(1, col("top_path"))
.sortWithinPartitions("path")
.write
.partitionBy("top_path")
.option("delimiter", "\t")
.mode("overwrite")
.csv(outputPath)
Even though I am doing a sortWithinPartitions
on the path
column, I am still seeing that some of the output isn't sorted as expected. Does anyone know why this is happening and how it can be fixed? I have tried sortWithinPartitions("top_path", "path")
but that still didn't sort by path
properly when writing. I expect sorting to occur in ascending order by path
. For example in some cases I am seeing output like
path1 1
path1/subpath1 2
path1 3
path1/subpath2 4
instead of
path1 1
path1 3
path1/subpath1 2
path1/subpath2 4