0

I am very much new to scala and I have a csv file :

MSH     ModZId  ModProd     Date
1140000 zzz      abc    2/19/2018
1140000 zzz      xyz    2/19/2018
651     zzz      def    2/19/2018
651     zzz      ghi    2/19/2018
1140000 zzz      klm    2/19/2018
860000  zzz      mno    2/26/2018
860000  zzz      pqr    2/26/2018
122     zzz      stu    2/26/2018
122     zzz      wxy    2/26/2018
860000  zzz      ijk    2/26/2018

I need to partition the csv file on the basis of date and convert the partition on to the parquet like below:

Parquet Output 1:

MSH     ModZId  ModProd  Date
1140000 zzz     abc     2/19/2018
1140000 zzz     xyz     2/19/2018
651     zzz     def     2/19/2018
651     zzz     ghi     2/19/2018
1140000 zzz     klm     2/19/2018

Parquet Output 2 :

MSH     ModZId  ModProd  Date
860000  zzz     mno     2/26/2018
860000  zzz     pqr     2/26/2018
122     zzz     stu     2/26/2018
122     zzz     wxy     2/26/2018
860000  zzz     ijk     2/26/2018

Can anyone please help me .I am very much new and do not know how can i partition the csv file in scala on the basis of date

abssab
  • 103
  • 1
  • 3
  • 10

1 Answers1

0

If you already read the csv file and get the data as above, then you can use partitionBy while writing as parquet as below

df.write.partitionBy("Date").parquet("outputpath")

This creates a folder equal to the number of partition on Date.

koiralo
  • 22,594
  • 6
  • 51
  • 72
  • not really,it is giving me date_2/19/2018 and i need only 2/19/2018 folder – abssab Jul 29 '19 at 12:08
  • Well for that you can write a separate script to change the name the spark jobs finished. – koiralo Jul 29 '19 at 12:11
  • Take a look at this one https://stackoverflow.com/questions/36107581/change-output-filename-prefix-for-dataframe-write/36108367#36108367 – koiralo Jul 29 '19 at 14:47
  • i am not able to find any simple method to rename the file from date_2/19/2018 to 2/19/2018.Please suggest me – abssab Jul 30 '19 at 09:48
  • yes,shankar ,i was thinking the same to rename the folder .But ,i am struggling to rename all the folder .Not able to list all the folders and then rename to required name – abssab Jul 30 '19 at 10:07
  • Can you please help me if you have anything to list all the directory and then rename it – abssab Jul 30 '19 at 10:07
  • no ,ext4 is the storage ,so i have folder with name Date_20190101,Date_20190201,Date_20190301.Can i rename them directly to 20190101,20190201,20190301 – abssab Jul 30 '19 at 10:15
  • Here is simple way in scala, change it as you want ` val outputDir = new File("output folder ") outputDir.listFiles().filter(file => file.isDirectory && file.getName.startsWith("Date_") ).foreach(file => { val newfileName = file.getName.replace("Date_", "") file.renameTo(new File(outputDir, newfileName)) })` – koiralo Jul 30 '19 at 10:20
  • yes,trying this in my machine and can you please help me in one more thing that i have list of dates 20190101,20190201 and i need to delete the folder if exists with the same name .For e.g. location of folder is /usr/local/20190101 and list is containing element 20190101 then i need to delete the folder – abssab Jul 30 '19 at 11:37