how to do it in spark i.e. iterate groups and save each group as file at a time?

Asked Jan 17 '19 at 12:23

Active Jan 17 '19 at 12:23

Viewed 289 times

I have a huge data , which is accumalated each year , quarterly-wise. This data is skewed a bit , when I try to get all data into one dataframe by repartitoning it into ("year", "quarter") it is shuffling a lot of data on disk spill which is making my job slow , more over only one executor working 80% of the time.

Hence I decided to 1) get distinct groups of dataframe , grouping by year and quarter-wise. 2) iterate/loop this distinct data frame by group fetch the data group where = year of the group save this dataframe/grop as parquet file continue iteration.

In Java we can use for loop on groups

but in spark with scala how to do it ?

asked Jan 17 '19 at 12:23

Shasu

hope [this answer](https://stackoverflow.com/a/52624692/1025328) helps you... – Prasad Khode Jan 17 '19 at 12:34
@Suraj Kumar , hi how to enable the alerts on particular tag ? – Shasu Jan 17 '19 at 12:51
@Tzach Zohar can you please help me – Shasu Jan 21 '19 at 07:30
@PrasadKhode thankx for quick reply . but how can I group by data in dataframe , each group should be collected in a new dataframe , this dataframe should be send to the function which writes as parquet file – Shasu Jan 22 '19 at 10:42
@PrasadKhode by the way i am using spark-sql 2.3.1 one with scala-2.11 v – Shasu Jan 22 '19 at 10:43
@Shankar Koirala any idea sir – Shasu Feb 20 '19 at 10:00

how to do it in spark i.e. iterate groups and save each group as file at a time?

0 Answers0