Iterate over dataframe's column and partition and save the dataframe on the basis of partitioned column

Question

I am very much new to scala and I have a csv file :

MSH     ModZId  ModProd     Date
1140000 zzz      abc    2/19/2018
1140000 zzz      abc    2/19/2018
651     zzz      abc    2/19/2018
651     zzz      abc    2/19/2018
1140000 zzz      abc    2/19/2018
860000  zzz      mno    2/26/2018
860000  zzz      mno    2/26/2018
122     zzz      mno    2/26/2018
122     zzz      mno    2/26/2018
860000  zzz      mno    2/26/2018
1140000 zzz      pxy    2/19/2018
1140000 zzz      pxy    2/19/2018

I need to partition the csv file on the basis of date and convert the partition on to the parquet like below:

Folder name 2018/02/19

 and parquet file1 output 

MSH     ModZId  ModProd  Date
1140000 zzz     abc     2/19/2018
1140000 zzz     xyz     2/19/2018
651     zzz     def     2/19/2018
651     zzz     ghi     2/19/2018
1140000 zzz     klm     2/19/2018

parquet file2 Output
 MSH     ModZId  ModProd  Date
1140000 zzz      pxy    2/19/2018
1140000 zzz      pxy    2/19/2018

Folder Name 20180226

MSH     ModZId  ModProd  Date
860000  zzz     mno     2/26/2018
860000  zzz     pqr     2/26/2018
122     zzz     stu     2/26/2018
122     zzz     wxy     2/26/2018
860000  zzz     ijk     2/26/2018

I am trying this and not sure how to iterate over the dataframe

 val writeDF = df
        .select ($"ModProd  ",$"Date").distinct().orderBy($"ModProd  ",$"Date")

    writeDF.show()

    df
      .write
      .mode(SaveMode.Overwrite)
      .format("parquet")
      .partitionBy("Date")
      .save(Path)

}

Can anyone please help me .I am very much new and do not know how can i partition the csv file in scala on the basis of date

maybe this is helpful, [check it](https://stackoverflow.com/questions/37807124/apache-spark-using-folder-structures-to-reduce-run-time-of-analyses) — Lamanus, Jul 29 '19 at 12:23
https://stackoverflow.com/questions/36107581/change-output-filename-prefix-for-dataframe-write/36108367#36108367 — koiralo, Jul 29 '19 at 14:54

Iterate over dataframe's column and partition and save the dataframe on the basis of partitioned column

0 Answers0