Scala sort and save csv - creating multiple csv files

Asked Sep 13 '18 at 13:56

Active Sep 13 '18 at 13:57

Viewed 68 times

Task - read a csv file, add 2 columns in lower case, sort & save the file. Problem - if sorting is applied, it is creating multiple files. Can someone please explain me what is happening here?

var df = spark.read
  .format("csv")
  .option("header", "true")
  .load(i_file)
  .select("Id", "Name", "Address")

df = df.withColumn("x_name", lower(col("Name")))
df = df.withColumn("x_address", lower(col("Address")))
df = df.orderBy("x_name") <---this line
df.write.option("header", "true").csv(o_file)

If I remove orderBy, it will create 1 file.

edited Sep 13 '18 at 13:57

Xavier Guihot

54,987
21
291
190

asked Sep 13 '18 at 13:56

Eyedia Tech

hmm..may be it does not matter, let spark store these in partitioned file. That is my understanding! – Eyedia Tech Sep 13 '18 at 13:59
Thanks @Dima, that answers my question, sorry for the duplicate, not sure why could not find that one! – Eyedia Tech Sep 13 '18 at 15:53

Scala sort and save csv - creating multiple csv files

0 Answers0